T

Test-Driven Development (TDD)

A software discipline where you write a failing test before writing the code that makes it pass — used for decades in traditional development and now extending to AI agents through "evals-as-tests."

What it is

Test-driven development is a workflow popularized by Kent Beck in the early 2000s: write a small test that describes the behavior you want, watch it fail, write the smallest amount of production code to make it pass, refactor, repeat. The discipline is summarized as "red, green, refactor" and produces tightly specified code with a built-in regression suite as a byproduct. In 2025–2026, the same discipline is being adapted to AI systems through "evals-as-tests" or "TDD for agents": before prompting or fine-tuning, define a small set of input/output expectations or quality rubrics; run the agent; grade against them; iterate. Because LLM outputs are nondeterministic, the "tests" are often graded probabilistically (pass-rate over N runs) or against rubrics (quality score above a threshold) rather than exact-match assertions, but the discipline of "specify behavior first, observe failures, then change the system" carries through.

Why it matters

For traditional code, TDD compresses the design loop, surfaces edge cases early, and produces a regression suite that prevents future breakage. For AI agents, the same loop is even more valuable: outputs are nondeterministic, prompts are brittle, model swaps shift behavior in ways unit tests cannot catch, and quality regressions in production are expensive to detect after the fact. Teams that build evals first — before shipping the prompt or the agent — own a regression boundary that survives prompt edits, model upgrades, and vendor swaps. Teams that ship without evals have to discover regressions through customer complaints or cost spikes. As LLM vendors release new models monthly, the value of having an eval-first agent codebase compounds the way the value of a unit test suite compounded for traditional codebases in the 2000s.

Key components

  • Red-green-refactor — the core TDD loop applied to traditional code
  • Eval definition — input/output examples or quality rubrics that describe expected behavior
  • Probabilistic grading — pass-rate over N runs, suitable for nondeterministic LLM outputs
  • Rubric-based scoring — graded quality scores against criteria, vs exact-match assertions
  • Continuous evaluation — evals run on every prompt change, model swap, and deployment

Need Help Implementing This?

We specialize in putting AI and Agentforce to work for Salesforce customers. Let's talk about your use case.

Book Intro Call