What it is
Test-driven development is a workflow popularized by Kent Beck in the early 2000s: write a small test that describes the behavior you want, watch it fail, write the smallest amount of production code to make it pass, refactor, repeat. The discipline is summarized as "red, green, refactor" and produces tightly specified code with a built-in regression suite as a byproduct. In 2025–2026, the same discipline is being adapted to AI systems through "evals-as-tests" or "TDD for agents": before prompting or fine-tuning, define a small set of input/output expectations or quality rubrics; run the agent; grade against them; iterate. Because LLM outputs are nondeterministic, the "tests" are often graded probabilistically (pass-rate over N runs) or against rubrics (quality score above a threshold) rather than exact-match assertions, but the discipline of "specify behavior first, observe failures, then change the system" carries through.
Why it matters
For traditional code, TDD compresses the design loop, surfaces edge cases early, and produces a regression suite that prevents future breakage. For AI agents, the same loop is even more valuable: outputs are nondeterministic, prompts are brittle, model swaps shift behavior in ways unit tests cannot catch, and quality regressions in production are expensive to detect after the fact. Teams that build evals first — before shipping the prompt or the agent — own a regression boundary that survives prompt edits, model upgrades, and vendor swaps. Teams that ship without evals have to discover regressions through customer complaints or cost spikes. As LLM vendors release new models monthly, the value of having an eval-first agent codebase compounds the way the value of a unit test suite compounded for traditional codebases in the 2000s.
Key components
- Red-green-refactor — the core TDD loop applied to traditional code
- Eval definition — input/output examples or quality rubrics that describe expected behavior
- Probabilistic grading — pass-rate over N runs, suitable for nondeterministic LLM outputs
- Rubric-based scoring — graded quality scores against criteria, vs exact-match assertions
- Continuous evaluation — evals run on every prompt change, model swap, and deployment
Related terms
Prompt Engineering
The practice of crafting precise instructions to guide an AI model's behavior, capabilities, and limitations.
Agent Observability
The practice of inspecting, debugging, and understanding AI agent behavior at runtime by consuming agent telemetry — traces, metrics, logs, and events — through dashboards, alerts, and evaluation tools.
Agent Operations
The discipline of running AI agents in production — capturing what they do, attributing what it costs, evaluating what they produce, and intervening when something goes wrong. The operational layer above agent observability and orchestration.
Vendor-Neutral AI
An architecture pattern where AI capabilities — skills, agents, evaluations — are defined separately from the LLM vendor that runs them, so the same capability can execute on Anthropic, OpenAI, xAI, Gemini, or local models without rewriting.