Reasoning
Overview
Reasoning models like OpenAI o1 and o3, Google’s Gemini 2.0 Flash Thinking, and DeepSeek’s r1 have some additional options that can be used to tailor their behaviour. They also in some cases make available full or partial reasoning traces for the chains of thought that led to their response.
This article covers using these models with Inspect—note that the state of support for reasoning models is very early and uneven across providers. As such we’ll note below which providers are known to support various options and behaviours.
Reasoning Effort
OpenAI o1 and o3 models support a reasoning_effort
field that can be set to low
, medium
or high
.
Gemini 2.0 Flash Thinking does not yet support an option to configure reasoning effort.
Deepseek has also indicated that support for the reasoning_effort
option will be available soon. Presumably this will option will also be available from other services hosting r1 models over time.
To support OpenAI today and other provides in the future, Inspect now includes the reasoning_effort
field when using the OpenAI provider (as many services including Deepseek and Together AI are accessed using the OpenAI provider).
Reasoning Traces
In some cases reasoning models provide traces of their chain of thought as part of assistant responses. When available, these traces are provided by Inspect in a new reasoning
field of ChatMessageAssistant
.
Reasoning traces are currently captured in two ways:
From OpenAI compatible provider responses that include a
reasoning
orreasoning_content
field in the assistant message (the latter is currently provided by DeepSeek).From
<think></think>
tags included in the main assistantcontent
(this is how Together, Groq, and Ollama currently present reasoning traces).
Gemini 2.0 Flash Thinking currently includes its reasoning inline with response content (there is currently no structured way to extract it, although this seems likely to change in the future).
We have confirmed that reasoning
traces can be extracted from models using the together
, groq
, and ollama
providers.
We would like to confirm this for other providers (e.g. bedrock
and azureai
) but as of yet have not been able to access reasoning models on those services (we’d very much welcome others contributing here, either to confirm that things work or to provide PRs which make the required changes).
Reasoning History
Model APIs do not yet have fields representing reasoning content, so it isn’t possible to replay previous reasoning traces in a structured way. Nevertheless, it is likely useful to replay this content (e.g. the Gemini Flash Thinking docs encourage reply of reasoning history).
To enable models to see their previous reasoning traces, Inspect will by default include reasoning in <think></think>
tags when replaying chat history to models. This behavior can be disabled by a new reasoning_history
option on GenerateConfig
.