Reasoning
Overview
Reasoning models like OpenAI o1 and o3, Anthropic’s Claude Sonnet 3.7, Google’s Gemini 2.0 Flash Thinking, and DeepSeek’s r1 have some additional options that can be used to tailor their behaviour. They also in some cases make available full or partial reasoning traces for the chains of thought that led to their response.
In this article we’ll first cover the basics of Reasoning Content and Reasoning Options, then cover the usage and options supported by various reasoning models.
Reasoning Content
Many reasoning models allow you to see their underlying chain of thought in a special “thinking” or reasoning block. While reasoning is presented in different ways depending on the model, in the Inspect API it is normalised into ContentReasoning blocks which are parallel to ContentText, ContentImage, etc.
Reasoning blocks are presented in their own region in both Inspect View and in terminal conversation views.
While reasoning content isn’t made available in a standard fashion across models, Inspect does attempt to capture it using several heuristics, including responses that include a reasoning
or reasoning_content
field in the assistant message, assistant content that includes <think></think>
tags, as well as using explicit APIs for models that support them (e.g. Claude 3.7).
Reasoning Options
The following reasoning options are available from the CLI and within GenerateConfig:
Option | Description | Default | Models |
---|---|---|---|
reasoning_effort |
Constrains effort on reasoning for reasoning models (low , medium , or high ) |
medium |
OpenAI o1/o3 |
reasoning_tokens |
Maximum number of tokens to use for reasoning. | (none) | Claude 3.7 |
reasoning_history |
Include reasoning in message history sent to model (none , all , last , or auto ) |
auto |
All models |
As you can see from above, models have different means of specifying the tokens to allocate for reasoning (reasoning_effort
and reasoning_tokens
). The two options don’t map precisely into each other, so if you are doing an evaluation with multiple reasoning models you should specify both. For example:
eval(
task,=["openai/o3-mini","anthropic/anthropic/claude-3-7-sonnet-20250219"],
model="medium",
reasoning_effort=4096
reasoning_tokens )
The reasoning_history
option lets you control how much of the model’s previous reasoning is presented in the message history sent to generate(). The default is auto
, which uses a provider-specific recommended default (normally all
). Use last
to not let the reasoning overwhelm the context window.
OpenAI o1/o3
OpenAI has several reasoning models available including the o1 and o3 series (openai/o1
, openai/o1-mini
, and openai/o3-mini
). Learn more about the specific models available in the OpenAI Models documentation.
You can condition the amount of reasoning done via the reasoning_effort
option, which can be set to low
, medium
, or high
(the default is medium
if not specified).
OpenAI models currently do not have provision for displaying reasoning content or replaying it to the model.
Claude 3.7 Sonnet
Anthropic’s Claude 3.7 Sonnet model includes optional support for extended thinking. Unlike other reasoning models 3.7 Sonnet is a hybrid model that supports both normal and reasoning modes. This means that you need to explicitly request reasoning by specifying the reasoning_tokens
option, for example:
inspect eval math.py \
--model anthropic/claude-3-7-sonnet-latest \
--reasoning-tokens 4096
Tokens
The max_tokens
for any given request is determined as follows:
- If you only specify
reasoning_tokens
, then themax_tokens
will be set to4096 + reasoning_tokens
(as 4096 is the standard Inspect default for Anthropic max tokens). - If you explicitly specify a
max_tokens
, that value will be used as the max tokens without modification (so should accomodate sufficient space for both yourreasoning_tokens
and normal output).
Inspect will automatically use response streaming whenever extended thinking is enabled to mitigate against networking issue that can occur for long running requests.
History
Note that Anthropic requests that all reasoning blocks and played back to the model in chat conversations (although they will only use the last reasoning block and will not bill for tokens on previous ones). Consquently, the reasoning_history
option has no effect for Claude 3.7 models (it effectively always uses last
).
Tools
When using tools, you should read Anthropic’s documentation on extended thinking with tool use. In short, thinking occurs on the first assistant turn and then the normal tool loop is run without additional thinking. Thinking is re-triggered when the tool loop is exited (i.e. a user message without a tool result is received).
Google Flash Thinking
Google currently makes available a single experimental reasoning model (Gemini Flash Thinking) which you can access using the model name google/gemini-2.0-flash-thinking-exp
.
There aren’t currently options for reasoning effort or reasoning tokens. By default Gemini currently includes all reasoning in the model history and recommends that it all be included in subsequent requests in a conversation.
Inspect captures reasoning blocks from Gemini using the “Final Answer:” delimiter currently used by Gemini 2.0 Flash Thinking (the API has a separate field for `thinking` but it is not currently used in responses).
DeepSeek-R1
DeepSeek-R1 is an open-weights reasoning model from DeepSeek. It is generally available either in its original form or as a distillation of R1 based on another open weights model (e.g. Qwen or Llama-based models).
DeepSeek models can be accessed directly using their OpenAI interface. Further, a number of model hosting providers supported by Inspect make DeepSeek available, for example:
Provider | Model |
---|---|
Together AI | together/deepseek-ai/DeepSeek-R1 (docs) |
Groq | groq/deepseek-r1-distill-llama-70b (docs) |
Ollama | ollama/deepseek-r1:<tag> (docs) |
There isn’t currently a way to customise the reasoning_effort
of DeepSeek models, although they have indicated that this will be available soon.
Reasoning content from DeepSeek models is captured using either the reasoning_content
field made available by the hosted DeepSeek API or the <think>
tags used by various hosting providers.