What it is
A reasoning model is an LLM that is post-trained (usually with reinforcement learning) to emit a structured internal reasoning trace before its final answer. The user typically does not see the trace; what they see is a higher-quality response that took longer to produce. Examples include OpenAI's o1 and o3 family, Anthropic's extended-thinking modes (in Claude Opus 4.x and Sonnet 4.x), DeepSeek R1, Qwen QwQ, Google Gemini Deep Think, and xAI's reasoning variants. Reasoning models are usually slower per query and more expensive per answer than non-reasoning models of similar parameter count, but they unlock measurably better performance on benchmarks involving multi-step deduction (AIME math, GPQA science, SWE-bench code, frontier ARC-AGI puzzles). They are the most prominent commercial application of the broader "test-time compute" principle.
Why it matters
Reasoning models inverted the assumption that "answer in one shot" was the only inference pattern. For complex work — debugging code across files, writing a multi-section research report, planning a multi-step agent workflow — they produce qualitatively different output than fast chat models. For routine work — extracting fields from a document, classifying support tickets, simple Q&A — they're slower and pricier without delivering benefit. The operational decision in 2026 is no longer "which one model do we use" but "which workloads warrant a reasoning model and which don't." That decision is the routing layer's job; the substrate that captures cost-per-task and quality-per-task lets that layer learn rather than guess.
Key components
- Internal reasoning traces — structured thinking tokens emitted before the user-facing answer
- RL post-training — the technique that produces reasoning behavior from a base model
- Benchmark uplift — disproportionate gains on math, code, science, multi-step deduction
- Cost and latency tradeoff — slower and pricier per answer, but higher quality on the right workloads
- Routing implications — workload classification becomes a first-class operational concern
Related terms
LLM (Large Language Model)
The AI technology behind ChatGPT, Claude, and the intelligence in Agentforce. Trained on massive amounts of text to understand and generate human language.
Agent Operations
The discipline of running AI agents in production — capturing what they do, attributing what it costs, evaluating what they produce, and intervening when something goes wrong. The operational layer above agent observability and orchestration.
Capability Registry
A structured catalog that maps AI capabilities (reasoning, structured output, tool use, vision, long context) to the models that can serve them — the substrate that makes skills portable across LLM vendors.
Token-Maxxing
The deliberate strategy of spending generous amounts of inference tokens — through extended thinking, deep research loops, or multi-shot agentic chains — to maximize output quality. The "more tokens equals better answers" doctrine that emerged with reasoning models. Also spelled "token-maxing."
Test-Time Compute
The principle that an AI model's output quality scales with the amount of compute it spends at inference time — not just with the size of the model. The architectural shift behind reasoning models like o1, Claude with extended thinking, and DeepSeek R1.