Skip to content
Navigation menu
Search
Powered by Algolia
Search
Log in
Create account
DEV Community
Close
#
evaluation
Follow
Hide
Posts
Left menu
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
Right menu
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 17
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces
#
ai
#
evaluation
#
observability
#
typescript
Comments
Add Comment
5 min read
Put Your Agent Evals in CI or Stop Calling Them Evals
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 16
Put Your Agent Evals in CI or Stop Calling Them Evals
#
ai
#
agents
#
evaluation
#
devops
1
 reaction
Comments
Add Comment
5 min read
An LLM benchmark is only useful for as long as it's hard
Arthur
Arthur
Arthur
Follow
Jun 11
An LLM benchmark is only useful for as long as it's hard
#
llm
#
evaluation
#
benchmarks
#
humaneval
2
 reactions
Comments
Add Comment
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 9
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.
#
ai
#
agents
#
safety
#
evaluation
2
 reactions
Comments
Add Comment
11 min read
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 5
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation
#
ai
#
testing
#
agents
#
evaluation
Comments
Add Comment
4 min read
Monitoring vs Evaluation — What's the Difference (and Why It Matters)
Phylis Korir
Phylis Korir
Phylis Korir
Follow
Jun 3
Monitoring vs Evaluation — What's the Difference (and Why It Matters)
#
monitoring
#
evaluation
#
projectmanagement
#
beginners
5
 reactions
Comments
Add Comment
6 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
Saurav Bhattacharya
Saurav Bhattacharya
Saurav Bhattacharya
Follow
Jun 7
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks
#
ai
#
security
#
evaluation
#
agents
1
 reaction
Comments
Add Comment
5 min read
第一次对AI Agent的精神病ĺ¦čŻ„äĽ°
guangda
guangda
guangda
Follow
Jun 6
第一次对AI Agent的精神病ĺ¦čŻ„äĽ°
#
ai
#
agents
#
psychology
#
evaluation
1
 reaction
Comments
Add Comment
1 min read
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
Bala Madhusoodhanan
Bala Madhusoodhanan
Bala Madhusoodhanan
Follow
May 25
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions
#
aibuilder
#
powerplatform
#
evaluation
#
powerfuldevs
5
 reactions
Comments
Add Comment
4 min read
The First Psychiatric Evaluation of AI Agents
guangda
guangda
guangda
Follow
Jun 5
The First Psychiatric Evaluation of AI Agents
#
ai
#
agents
#
psychology
#
evaluation
Comments
Add Comment
3 min read
Why I used three different critic roles instead of one (and what the eval taught me)
Bohyeon Jang
Bohyeon Jang
Bohyeon Jang
Follow
May 31
Why I used three different critic roles instead of one (and what the eval taught me)
#
llm
#
python
#
ai
#
evaluation
Comments
2
 comments
6 min read
Building a domain-specific LLM evaluation set from scratch
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 4
Building a domain-specific LLM evaluation set from scratch
#
llm
#
ai
#
evaluation
#
opensource
1
 reaction
Comments
Add Comment
8 min read
What is an LLM evaluation harness? A deep dive into lm-eval-harness
Tech_Nuggets
Tech_Nuggets
Tech_Nuggets
Follow
Jun 3
What is an LLM evaluation harness? A deep dive into lm-eval-harness
#
llm
#
ai
#
evaluation
#
opensource
1
 reaction
Comments
Add Comment
7 min read
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"
Prakhar Singh
Prakhar Singh
Prakhar Singh
Follow
May 13
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"
#
llm
#
codereview
#
evaluation
#
ai
2
 reactions
Comments
Add Comment
5 min read
How do you eval LLM output that isn't code?
ur-grue
ur-grue
ur-grue
Follow
May 29
How do you eval LLM output that isn't code?
#
ai
#
llm
#
evaluation
#
writing
Comments
1
 comment
3 min read
đź‘‹
Sign in
for the ability to sort posts by
relevant
,
latest
, or
top
.
We're a place where coders share, stay up-to-date and grow their careers.
Log in
Create account