DEV Community

# evaluation

Posts

đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.
Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

Your Eval Suite Is Grading Fiction: Stop Inventing Test Cases and Mine Your Traces

Comments
5 min read
Put Your Agent Evals in CI or Stop Calling Them Evals

Put Your Agent Evals in CI or Stop Calling Them Evals

1
Comments
5 min read
An LLM benchmark is only useful for as long as it's hard

An LLM benchmark is only useful for as long as it's hard

2
Comments
10 min read
I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

I Tested Claude Opus 4, GPT-4.1, GPT-4o, Sonnet 4, and Gemini 2.5 Pro on 10 Adversarial Scenarios. They All Broke on the Same One.

2
Comments
11 min read
Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Deterministic Checks vs Model-as-Judge: A Tiered Approach to Agent Evaluation

Comments
4 min read
Monitoring vs Evaluation — What's the Difference (and Why It Matters)

Monitoring vs Evaluation — What's the Difference (and Why It Matters)

5
Comments
6 min read
Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

Evals Are Alignment Enforcement: Why Your Safety Strategy Needs Runtime Checks

1
Comments
5 min read
第一次对AI Agent的精神病学评估

第一次对AI Agent的精神病学评估

1
Comments
1 min read
Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

Prompt Engineering for Automated Evaluation: Making LLMs the Judge in AI Builder Solutions

5
Comments
4 min read
The First Psychiatric Evaluation of AI Agents

The First Psychiatric Evaluation of AI Agents

Comments
3 min read
Why I used three different critic roles instead of one (and what the eval taught me)

Why I used three different critic roles instead of one (and what the eval taught me)

Comments 2
6 min read
Building a domain-specific LLM evaluation set from scratch

Building a domain-specific LLM evaluation set from scratch

1
Comments
8 min read
What is an LLM evaluation harness? A deep dive into lm-eval-harness

What is an LLM evaluation harness? A deep dive into lm-eval-harness

1
Comments
7 min read
Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

Evaluating LLM code reviewers: an offline harness for precision, recall, and routing"

2
Comments
5 min read
How do you eval LLM output that isn't code?

How do you eval LLM output that isn't code?

Comments 1
3 min read
đź‘‹ Sign in for the ability to sort posts by relevant, latest, or top.