T

Test-Time Compute

The principle that an AI model's output quality scales with the amount of compute it spends at inference time — not just with the size of the model. The architectural shift behind reasoning models like o1, Claude with extended thinking, and DeepSeek R1.

What it is

Test-time compute (also "inference-time compute") refers to the realization that letting an LLM spend more compute when answering a single question — through hidden reasoning chains, multi-step deliberation, repeated sampling, or self-correction — produces meaningfully better answers on hard problems than simply asking a bigger model once. OpenAI's o1 (2024) and o3, Anthropic's extended-thinking modes, DeepSeek R1, and Google's Gemini Deep Think all operationalize this in different ways: the model produces a long internal reasoning trace before emitting the user-facing answer, and the length of that trace is tuneable. The principle generalizes beyond reasoning models to any pattern that spends more inference compute per query: agentic tool-use loops, multi-shot voting, deep research patterns. The economic shift is that quality is no longer purely a training-time investment — runtime spend now directly buys answer quality.

Why it matters

For two decades, software industry intuition was that "smarter answers" came from training bigger or better models. Test-time compute breaks that frame: a smaller, faster, cheaper model that's allowed to think longer can outperform a larger model answering immediately. For agent operations, this changes the product surface — the right question is no longer "which model do we pick?" but "how much inference budget do we spend per query, and on which queries?" That's a cost-attribution and routing problem (see capability registry, LLM cost attribution). For SEO and AEO, the term has accelerated rapidly through 2025–2026 as reasoning models normalized; teams that understand it can write content that ranks against generic "what is GPT-4" pages.

Key components

  • Reasoning models — models specifically designed to spend hidden compute before answering (o1, o3, R1, extended thinking)
  • Agentic loops — agents that run many tool calls and self-corrections per task, also a form of test-time compute
  • Inference compute as a quality lever — separate from training, separately tunable, separately billable
  • Token-maxxing — the deliberate strategy of spending generous test-time budgets on hard queries
  • Token-thrift — the opposing discipline for high-volume, low-stakes workloads

Need Help Implementing This?

We specialize in putting AI and Agentforce to work for Salesforce customers. Let's talk about your use case.

Book Intro Call