What it is
Test-time compute (also "inference-time compute") refers to the realization that letting an LLM spend more compute when answering a single question — through hidden reasoning chains, multi-step deliberation, repeated sampling, or self-correction — produces meaningfully better answers on hard problems than simply asking a bigger model once. OpenAI's o1 (2024) and o3, Anthropic's extended-thinking modes, DeepSeek R1, and Google's Gemini Deep Think all operationalize this in different ways: the model produces a long internal reasoning trace before emitting the user-facing answer, and the length of that trace is tuneable. The principle generalizes beyond reasoning models to any pattern that spends more inference compute per query: agentic tool-use loops, multi-shot voting, deep research patterns. The economic shift is that quality is no longer purely a training-time investment — runtime spend now directly buys answer quality.
Why it matters
For two decades, software industry intuition was that "smarter answers" came from training bigger or better models. Test-time compute breaks that frame: a smaller, faster, cheaper model that's allowed to think longer can outperform a larger model answering immediately. For agent operations, this changes the product surface — the right question is no longer "which model do we pick?" but "how much inference budget do we spend per query, and on which queries?" That's a cost-attribution and routing problem (see capability registry, LLM cost attribution). For SEO and AEO, the term has accelerated rapidly through 2025–2026 as reasoning models normalized; teams that understand it can write content that ranks against generic "what is GPT-4" pages.
Key components
- Reasoning models — models specifically designed to spend hidden compute before answering (o1, o3, R1, extended thinking)
- Agentic loops — agents that run many tool calls and self-corrections per task, also a form of test-time compute
- Inference compute as a quality lever — separate from training, separately tunable, separately billable
- Token-maxxing — the deliberate strategy of spending generous test-time budgets on hard queries
- Token-thrift — the opposing discipline for high-volume, low-stakes workloads
Related terms
LLM (Large Language Model)
The AI technology behind ChatGPT, Claude, and the intelligence in Agentforce. Trained on massive amounts of text to understand and generate human language.
Agent Operations
The discipline of running AI agents in production — capturing what they do, attributing what it costs, evaluating what they produce, and intervening when something goes wrong. The operational layer above agent observability and orchestration.
LLM Cost Attribution
The practice of tying every LLM call back to the task, agent, process, or skill that triggered it — across every vendor — so AI spend can be measured against outcomes, not just tokens.
Token-Maxxing
The deliberate strategy of spending generous amounts of inference tokens — through extended thinking, deep research loops, or multi-shot agentic chains — to maximize output quality. The "more tokens equals better answers" doctrine that emerged with reasoning models. Also spelled "token-maxing."
Reasoning Model
A class of large language model trained to spend hidden internal "thinking" tokens before producing a user-facing answer — often dramatically improving performance on math, code, science, and complex multi-step problems compared to non-reasoning models of similar size.