Test-Driven Development (TDD)

What it is

Test-driven development is a workflow popularized by Kent Beck in the early 2000s: write a small test that describes the behavior you want, watch it fail, write the smallest amount of production code to make it pass, refactor, repeat. The discipline is summarized as "red, green, refactor" and produces tightly specified code with a built-in regression suite as a byproduct. In 2025–2026, the same discipline is being adapted to AI systems through "evals-as-tests" or "TDD for agents": before prompting or fine-tuning, define a small set of input/output expectations or quality rubrics; run the agent; grade against them; iterate. Because LLM outputs are nondeterministic, the "tests" are often graded probabilistically (pass-rate over N runs) or against rubrics (quality score above a threshold) rather than exact-match assertions, but the discipline of "specify behavior first, observe failures, then change the system" carries through.

Why it matters

For traditional code, TDD compresses the design loop, surfaces edge cases early, and produces a regression suite that prevents future breakage. For AI agents, the same loop is even more valuable: outputs are nondeterministic, prompts are brittle, model swaps shift behavior in ways unit tests cannot catch, and quality regressions in production are expensive to detect after the fact. Teams that build evals first — before shipping the prompt or the agent — own a regression boundary that survives prompt edits, model upgrades, and vendor swaps. Teams that ship without evals have to discover regressions through customer complaints or cost spikes. As LLM vendors release new models monthly, the value of having an eval-first agent codebase compounds the way the value of a unit test suite compounded for traditional codebases in the 2000s.

Key components

Red-green-refactor — the core TDD loop applied to traditional code
Eval definition — input/output examples or quality rubrics that describe expected behavior
Probabilistic grading — pass-rate over N runs, suitable for nondeterministic LLM outputs
Rubric-based scoring — graded quality scores against criteria, vs exact-match assertions
Continuous evaluation — evals run on every prompt change, model swap, and deployment

What it is

Why it matters

Key components

Related terms

Prompt Engineering

Agent Observability

Agent Operations

Vendor-Neutral AI

Need Help Implementing This?