How to Evaluate AI Agent Performance: A Practical Framework

A practical guide to evaluating AI agents: why standard metrics fail, task success rate vs trajectory eval, LLM-as-judge, Ragas/PromptFoo/Langfuse/Braintrust, golden test sets, and what to put on a production dashboard.

Why Agent Evaluation Is Hard (And Different From Model Evaluation)

Evaluating a language model is hard. Evaluating an AI agent is harder by an order of magnitude, for three compounding reasons. Non-determinism: agents call LLMs internally, and LLM outputs are stochastic. Run the same agent on the same input twice and you may get two different tool call sequences, two different intermediate reasoning steps, and two different final answers — both of which could be correct. Standard unit test assertions break down when the correct path through a task isn't unique. Long action chains: an agent completing a five-step task can fail in many ways — wrong tool called in step 2, correct tool but wrong parameters in step 4, correct actions but incorrect synthesis in the final answer. Detecting where a chain went wrong requires logging and evaluating intermediate states, not just final outputs. Tool use makes outputs non-self-contained: the agent's answer depends on what the tools returned at runtime. The same query in a different environment (different database state, different API response) may have a different correct answer. Eval sets built against one environment don't automatically transfer. These challenges mean that borrowing evaluation methodology from software testing (pass/fail assertions) or NLP benchmarking (BLEU, ROUGE) will produce misleading results. Agent evaluation needs its own framework, and deploying an agent without one is essentially releasing untested code.

Task Success Rate vs Trajectory Evaluation

Task success rate is the bluntest evaluation instrument: did the agent produce the correct final output? It's easy to measure, easy to understand, and necessary for any evaluation framework — but it's insufficient on its own. An agent that gets the right answer for the wrong reasons (lucky tool call sequence, LLM that bypassed the intended reasoning path) will score 100% on task success but will fail on the next distribution shift. Trajectory evaluation evaluates the sequence of actions the agent took to reach its answer, not just the answer itself. A trajectory eval records every tool call (name, parameters, return value), every LLM reasoning step, every state transition, and compares this against a reference trajectory or a set of trajectory acceptance criteria. Trajectory eval catches failure modes that final-answer eval misses: an agent that calls an expensive tool unnecessarily (cost concern), an agent that made correct final tool calls but for incorrect reasons documented in its chain-of-thought (reliability concern), or an agent that skipped a required safety check step (compliance concern). The practical combination: use task success rate as your headline metric and SLA gate (below 90% is not shippable), and use trajectory eval for root cause analysis and regression detection. When task success rate drops, trajectory diffs (comparing passing vs failing trajectories for the same input) pinpoint which step in the chain introduced the failure. LangSmith's trace comparison feature and Braintrust's span-level evaluation both support trajectory diff analysis in production.

LLM-as-Judge: The Practical Standard for Open-Ended Tasks

For many agent tasks — answer quality, response tone, explanation clarity, factual grounding — there is no single correct output that a string comparison can verify. The practical solution is LLM-as-judge: use a separate, more capable LLM to evaluate the agent's output against a rubric. A well-designed LLM judge prompt specifies the evaluation dimension (e.g., factual accuracy, task completion, safety), a scoring rubric with concrete examples per score level, the agent's output, and any reference material (the retrieved context, the ground truth answer if available). The judge returns a score (1–5 or 0–1) and a brief explanation of its reasoning. LLM-as-judge has known limitations: judge models have their own biases (preferring longer, more confident-sounding outputs; correlating score with writing quality rather than factual accuracy). Mitigations: use a different model family as the judge than the agent (avoid judging GPT-4o output with GPT-4o), calibrate judge scores against a set of human-labeled examples to detect systematic biases, and require the judge to output reasoning before the score (chain-of-thought reasoning improves judge accuracy significantly). LLM-as-judge is also non-deterministic — run each evaluation multiple times and average scores when precision matters. For high-stakes domains (medical, legal, financial), LLM-as-judge should be one signal among several, not the only quality gate. The AI Readiness Assessment on AgentList includes a checklist for establishing eval coverage before going to production.

Specialized Eval Tools: Ragas, PromptFoo, Langfuse, Braintrust

The eval tooling ecosystem has matured significantly. Each tool occupies a distinct niche. Ragas is purpose-built for RAG evaluation: it computes context precision, context recall, faithfulness, and answer relevancy using LLM-based scoring without requiring human-labeled ground truth. Its primary limitation is that it's RAG-specific — it doesn't evaluate general agent behavior. PromptFoo is a prompt-level testing and comparison tool. It excels at A/B testing prompt variants, running assertions across a test set with customizable eval functions, and detecting regressions when prompts are updated. For teams doing frequent prompt iteration, PromptFoo integrates into CI/CD pipelines with a simple YAML configuration and provides diff views between prompt versions. Langfuse is a full-stack observability and evaluation platform: it provides trace logging (capturing every LLM call, tool invocation, and output), annotation workflows for human labeling, and LLM-as-judge scoring applied to logged traces. Its open-source core is self-hostable, which matters for teams with data residency requirements. Braintrust takes a developer-first approach: evaluations are written as TypeScript or Python code, run against datasets stored in Braintrust, and produce experiment results with score distributions, examples, and diffs against baseline experiments. It integrates well with CI and has a strong span-level evaluation model for complex multi-step agents. The recommendation: use Langfuse for production trace collection and monitoring, Braintrust for structured offline experiments and regression testing, PromptFoo for prompt-level A/B testing, and Ragas for RAG-specific pipeline evaluation.

Building a Golden Test Set

A golden test set is the foundation of all reliable agent evaluation. It's a curated set of inputs with verified correct outputs (and optionally, reference trajectories) that represents the distribution of tasks your agent will encounter in production. Building a useful golden set requires discipline across five dimensions. Coverage: the test set must cover the full range of task types the agent handles, including edge cases and failure modes you've already identified. A golden set that only covers happy paths will miss the regressions that matter. Ground truth quality: each expected output must be verified by a domain expert, not just generated by the agent and assumed correct. Bootstrapping is fine — generate agent outputs, then have experts review and correct them — but do not skip the expert review step. Distribution balance: if 80% of your production queries are a single intent type, your golden set should reflect that distribution, not be uniformly distributed across intent types. This ensures your headline metrics reflect what users actually experience. Difficulty stratification: tag each example as easy, medium, or hard. Track eval metrics by difficulty tier separately. Regressions on hard examples are more concerning than regressions on easy ones. Regular refresh: golden sets go stale as user behavior evolves. Schedule a quarterly review to add new examples from recent production failures and remove examples that are no longer representative. A starting target for a production golden set is 200–500 examples. Below 100, metric variance makes it hard to detect real regressions from noise. Above 2,000, the cost of maintaining ground truth quality becomes burdensome — invest in infrastructure for semi-automated label curation instead.

Regression Testing in CI

Agent eval should run on every pull request that modifies prompts, tools, agent logic, or upstream dependencies — the same discipline as unit and integration tests for regular software. The CI eval pipeline has three components. A fast eval subset: run a curated 50–100 example subset of the golden set on every PR. This should complete in under 5 minutes to keep the feedback loop tight. Use cached tool responses (recorded from real API calls in a fixtures file) so the eval is deterministic and doesn't depend on external service availability. A full eval run on main: run the complete golden set on every merge to the main branch. This is the authoritative quality signal. Score thresholds are enforced as CI gates: a merge that drops task success rate below 88% or faithfulness below 0.82 fails the CI check and requires explicit sign-off to merge. A nightly extended eval: run additional adversarial and stress test scenarios that are too slow for PR-level gates but important to catch over time — prompt injection attempts, edge cases from the previous week's production failures, and load tests of tool call latency. The CI integration requires that your agent be invocable in a test harness that can inject fixture data, record outputs, and compare against the golden set. Design your agent's tool layer with dependency injection in mind from the start — this is nearly impossible to retrofit onto a tightly coupled implementation. LangSmith, Braintrust, and Langfuse all provide CI integration hooks; PromptFoo has native GitHub Actions support with a published action.

Production Dashboard: What Metrics to Track

A production agent dashboard has two layers: real-time operational metrics and slower-moving quality metrics. Operational metrics (tracked per minute): total invocations, error rate (failed completions), p50/p95/p99 latency, token consumption rate (leading indicator of cost), and tool call failure rate per tool. Quality metrics (tracked daily, derived from sampled eval): task success rate on sampled production queries, LLM-as-judge scores by task type, context faithfulness score for RAG-backed answers, hallucination detection rate (any answer flagged by automated groundedness checks), and user feedback signal (explicit thumbs up/down ratio, or implicit signals like follow-up question rate as a proxy for unsatisfied responses). Anomaly detection: set alert thresholds at 2 standard deviations from the 30-day rolling mean for all quality metrics. A sudden spike in tool call failure rate often precedes a quality degradation — it's a leading indicator. Track cost per successful task, not just total cost — this normalizes for volume changes and makes cost efficiency visible. A cost-per-task spike while task success rate holds steady means the agent is taking more steps to complete the same tasks — an efficiency regression that won't show up in accuracy metrics alone. The Performance Scorecard on AgentList provides a structured template for presenting these metrics to engineering leadership and business stakeholders with the appropriate framing for each audience.

Related Resources

Find agencies that specialize in the frameworks and use cases covered in this article.

AI Readiness Assessment →Performance Scorecard →Proposal Evaluator →LangChain Stack Profile →

Explore the Directory

Find the right AI agent agency for your project.