


Get an instance of a model.

Calls to get_model() are memoized (i.e. a call with the same arguments will return an existing instance of the model rather than creating a new one). You can disable this with memoize=False.

If you prefer to immediately close models after use (as well as prevent caching) you can employ the async context manager built in to the Model class. For example:

async with get_model("openai/gpt-4o") as model:
    response = await model.generate("Say hello")

In this case, the model client will be closed at the end of the context manager and will not be available in the get_model() cache.

def get_model(
    model: str | Model | None = None,
    config: GenerateConfig = GenerateConfig(),
    base_url: str | None = None,
    api_key: str | None = None,
    memoize: bool = True,
    **model_args: Any,
) -> Model
model str | Model | None

Model specification. If Model is passed it is returned unmodified, if None is passed then the model currently being evaluated is returned (or if there is no evaluation then the model referred to by INSPECT_EVAL_MODEL).

config GenerateConfig

Configuration for model.

base_url str | None

Optional. Alternate base URL for model.

api_key str | None

Optional. API key for model.

memoize bool

Use/store a cached version of the model based on the parameters to get_model()

**model_args Any

Additional args to pass to model constructor.


Model interface.

Use get_model() to get an instance of a model. Model provides an async context manager for closing the connection to it after use. For example:

async with get_model("openai/gpt-4o") as model:
    response = await model.generate("Say hello")
class Model


api ModelAPI

Model API.

config GenerateConfig

Generation config.

name str

Model name.



Create a model.

def __init__(self, api: ModelAPI, config: GenerateConfig) -> None
api ModelAPI

Model API provider.

config GenerateConfig

Model configuration.


Generate output from the model.

async def generate(
    input: str | list[ChatMessage],
    tools: list[Tool]
    | list[ToolDef]
    | list[ToolInfo]
    | list[Tool | ToolDef | ToolInfo] = [],
    tool_choice: ToolChoice | None = None,
    config: GenerateConfig = GenerateConfig(),
    cache: bool | CachePolicy = False,
) -> ModelOutput
input str | list[ChatMessage]

Chat message input (if a str is passed it is converted to a ChatMessageUser).

tools list[Tool] | list[ToolDef] | list[ToolInfo] | list[Tool | ToolDef | ToolInfo]

Tools available for the model to call.

tool_choice ToolChoice | None

Directives to the model as to which tools to prefer.

config GenerateConfig

Model configuration.

cache bool | CachePolicy

Caching behavior for generate responses (defaults to no caching).


Model generation options.

class GenerateConfig(BaseModel)


max_retries int | None

Maximum number of times to retry request (defaults to 5).

timeout int | None

Request timeout (in seconds).

max_connections int | None

Maximum number of concurrent connections to Model API (default is model specific).

system_message str | None

Override the default system message.

max_tokens int | None

The maximum number of tokens that can be generated in the completion (default is model specific).

top_p float | None

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

temperature float | None

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

stop_seqs list[str] | None

Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

best_of int | None

Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.

frequency_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, and vLLM only.

presence_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, and vLLM only.

logit_bias dict[int, float] | None

Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI, Grok, and Grok only.

seed int | None

Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.

top_k int | None

Randomly sample the next word from the top_k most likely next words. Anthropic, Google, HuggingFace, and vLLM only.

num_choices int | None

How many chat completion choices to generate for each input message. OpenAI, Grok, Google, TogetherAI, and vLLM only.

logprobs bool | None

Return log probabilities of the output tokens. OpenAI, Grok, TogetherAI, Huggingface, llama-cpp-python, and vLLM only.

top_logprobs int | None

Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Grok, Huggingface, and vLLM only.

parallel_tool_calls bool | None

Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.

internal_tools bool | None

Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).

max_tool_output int | None

Maximum tool output (in bytes). Defaults to 16 * 1024.

cache_prompt Literal['auto'] | bool | None

Whether to cache the prompt prefix. Defaults to “auto”, which will enable caching for requests with tools. Anthropic only.

reasoning_effort Literal['low', 'medium', 'high'] | None

Constrains effort on reasoning for reasoning models. Open AI o1 models only.

reasoning_history bool | None

Include reasoning in chat message history sent to generate.



Merge another model configuration into this one.

def merge(
    self, other: Union["GenerateConfig", GenerateConfigArgs]
) -> "GenerateConfig"
other Union[GenerateConfig, GenerateConfigArgs]

Configuration to merge.


Type for kwargs that selectively override GenerateConfig.

class GenerateConfigArgs(TypedDict, total=False)


max_retries int | None

Maximum number of times to retry request (defaults to 5).

timeout int | None

Request timeout (in seconds).

max_connections int | None

Maximum number of concurrent connections to Model API (default is model specific).

system_message str | None

Override the default system message.

max_tokens int | None

The maximum number of tokens that can be generated in the completion (default is model specific).

top_p float | None

An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.

temperature float | None

What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.

stop_seqs list[str] | None

Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.

best_of int | None

Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). vLLM only.

frequency_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, and vLLM only.

presence_penalty float | None

Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, and vLLM only.

logit_bias dict[int, float] | None

Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI and Grok only.

seed int | None

Random seed. OpenAI, Google, Mistral, Groq, HuggingFace, and vLLM only.

top_k int | None

Randomly sample the next word from the top_k most likely next words. Anthropic, Google, and HuggingFace only.

num_choices int | None

How many chat completion choices to generate for each input message. OpenAI, Grok, Google, and TogetherAI only.

logprobs bool | None

Return log probabilities of the output tokens. OpenAI, Grok, TogetherAI, Huggingface, llama-cpp-python, and vLLM only.

top_logprobs int | None

Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Grok, and Huggingface only.

parallel_tool_calls bool | None

Whether to enable parallel function calling during tool use (defaults to True). OpenAI and Groq only.

internal_tools bool | None

Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for anthropic).

max_tool_output int | None

Maximum tool output (in bytes). Defaults to 16 * 1024.

cache_prompt Literal['auto'] | bool | None

Whether to cache the prompt prefix. Defaults to “auto”, which will enable caching for requests with tools. Anthropic only.

reasoning_effort Literal['low', 'medium', 'high'] | None

Constrains effort on reasoning for reasoning models. Open AI o1 models only.

reasoning_history bool | None

Include reasoning in chat message history sent to generate.


Output from model generation.

class ModelOutput(BaseModel)


model str

Model used for generation.

choices list[ChatCompletionChoice]

Completion choices.

usage ModelUsage | None

Model token usage

time float | None

Time elapsed (in seconds) for call to generate.

metadata dict[str, Any] | None

Additional metadata associated with model output.

error str | None

Error message in the case of content moderation refusals.

stop_reason StopReason

First message stop reason.

message ChatMessageAssistant

First message choice.

completion str

Text of first message choice text.



Create ModelOutput from simple text content.

def from_content(
    model: str,
    content: str,
    stop_reason: StopReason = "stop",
    error: str | None = None,
) -> "ModelOutput"
model str

Model name.

content str

Text content from generation.

stop_reason StopReason

Stop reason for generation.

error str | None

Error message.


Returns a ModelOutput for requesting a tool call.

def for_tool_call(
    model: str,
    tool_name: str,
    tool_arguments: dict[str, Any],
    tool_call_id: str | None = None,
    content: str | None = None,
) -> "ModelOutput"
model str

model name

tool_name str

The name of the tool.

tool_arguments dict[str, Any]

The arguments passed to the tool.

tool_call_id str | None

Optional ID for the tool call. Defaults to a random UUID.

content str | None

Optional content to include in the message. Defaults to “tool call for tool {tool_name}”.


Token usage for completion.

class ModelUsage(BaseModel)


input_tokens int

Total input tokens used.

output_tokens int

Total output tokens used.

total_tokens int

Total tokens used.

input_tokens_cache_write int | None

Number of tokens written to the cache.

input_tokens_cache_read int | None

Number of tokens retrieved from the cache.


Reason that the model stopped or failed to generate.

StopReason = Literal[


Choice generated for completion.

class ChatCompletionChoice(BaseModel)


message ChatMessageAssistant

Assistant message.

stop_reason StopReason

Reason that the model stopped generating.

logprobs Logprobs | None




Message in a chat conversation

ChatMessage = Union[
    ChatMessageSystem, ChatMessageUser, ChatMessageAssistant, ChatMessageTool


Base class for chat messages.

class ChatMessageBase(BaseModel)


role Literal['system', 'user', 'assistant', 'tool']

Conversation role

content str | list[Content]

Content (simple string or list of content objects)

source Literal['input', 'generate'] | None

Source of message.

text str

Get the text content of this message.

ChatMessage content is very general and can contain either a simple text value or a list of content parts (each of which can either be text or an image). Solvers (e.g. for prompt engineering) often need to interact with chat messages with the assumption that they are a simple string. The text property returns either the plain str content, or if the content is a list of text and images, the text items concatenated together (separated by newline)


System chat message.

class ChatMessageSystem(ChatMessageBase)


role Literal['system']

Conversation role.


User chat message.

class ChatMessageUser(ChatMessageBase)


role Literal['user']

Conversation role.

tool_call_id list[str] | None

ID(s) of tool call(s) this message has the content payload for.


Assistant chat message.

class ChatMessageAssistant(ChatMessageBase)


role Literal['assistant']

Conversation role.

tool_calls list[ToolCall] | None

Tool calls made by the model.

reasoning str | None

Reasoning content.


Tool chat message.

class ChatMessageTool(ChatMessageBase)


role Literal['tool']

Conversation role.

tool_call_id str | None

ID of tool call.

function str | None

Name of function called.

error ToolCallError | None

Error which occurred during tool call.



Content sent to or received from a model.

Content = Union[ContentText, ContentImage, ContentAudio, ContentVideo]


Text content.

class ContentText(BaseModel)


type Literal['text']


text str

Text content.


Image content.

class ContentImage(BaseModel)


type Literal['image']


image str

Either a URL of the image or the base64 encoded image data.

detail Literal['auto', 'low', 'high']

Specifies the detail level of the image.

Currently only supported for OpenAI. Learn more in the Vision guide.


Audio content.

class ContentAudio(BaseModel)


type Literal['audio']


audio str

Audio file path or base64 encoded data URL.

format Literal['wav', 'mp3']

Format of audio data (‘mp3’ or ‘wav’)


Video content.

class ContentVideo(BaseModel)


type Literal['video']


video str

Audio file path or base64 encoded data URL.

format Literal['mp4', 'mpeg', 'mov']

Format of video data (‘mp4’, ‘mpeg’, or ‘mov’)



Log probability for a token.

class Logprob(BaseModel)


token str

The predicted token represented as a string.

logprob float

The log probability value of the model for the predicted token.

bytes list[int] | None

The predicted token represented as a byte array (a list of integers).

top_logprobs list[TopLogprob] | None

If the top_logprobs argument is greater than 0, this will contain an ordered list of the top K most likely tokens and their log probabilities.


Log probability information for a completion choice.

class Logprobs(BaseModel)


content list[Logprob]

a (num_generated_tokens,) length list containing the individual log probabilities for each generated token.


List of the most likely tokens and their log probability, at this token position.

class TopLogprob(BaseModel)


token str

The top-kth token represented as a string.

logprob float

The log probability value of the model for the top-kth token.

bytes list[int] | None

The top-kth token represented as a byte array (a list of integers).



The CachePolicy is used to define various criteria that impact how model calls are cached.

expiry: Default “24h”. The expiry time for the cache entry. This is a string of the format “12h” for 12 hours or “1W” for a week, etc. This is how long we will keep the cache entry, if we access it after this point we’ll clear it. Setting to None will cache indefinitely.

per_epoch: Default True. By default we cache responses separately for different epochs. The general use case is that if there are multiple epochs, we should cache each response separately because scorers will aggregate across epochs. However, sometimes a response can be cached regardless of epoch if the call being made isn’t under test as part of the evaluation. If False, this option allows you to bypass that and cache independently of the epoch.

scopes: A dictionary of additional metadata that should be included in the cache key. This allows for more fine-grained control over the cache key generation.

class CachePolicy



Create a CachePolicy.

def __init__(
    expiry: str | None = "1W",
    per_epoch: bool = True,
    scopes: dict[str, str] = {},
) -> None
expiry str | None


per_epoch bool

Per epoch

scopes dict[str, str]



Calculate the size of various cached directories and files

If neither subdirs nor files are provided, the entire cache directory will be calculated.

def cache_size(
    subdirs: list[str] = [], files: list[Path] = []
) -> list[tuple[str, int]]
subdirs list[str]

List of folders to filter by, which are generally model names. Empty directories will be ignored.

files list[Path]

List of files to filter by explicitly. Note that return value group these up by their parent directory


Clear the cache directory.

def cache_clear(model: str = "") -> bool
model str

Model to clear cache for.


Returns a list of all the cached files that have passed their expiry time.

def cache_list_expired(filter_by: list[str] = []) -> list[Path]
filter_by list[str]

Default []. List of model names to filter by. If an empty list, this will search the entire cache.


Delete all expired cache entries.

def cache_prune(files: list[Path] = []) -> None
files list[Path]

List of files to prune. If empty, this will search the entire cache.


Path to cache directory.

def cache_path(model: str = "") -> Path
model str

Path to cache directory for specific model.



Decorator for registering model APIs.

def modelapi(name: str) -> Callable[..., type[ModelAPI]]
name str

Name of API


Model API provider.

If you are implementing a custom ModelAPI provider your __init__() method will also receive a **model_args parameter that will carry any custom model_args (or -M arguments from the CLI) specified by the user. You can then pass these on to the approriate place in your model initialisation code (for example, here is what many of the built-in providers do with the model_args passed to them:

class ModelAPI(abc.ABC)



Create a model API provider.

def __init__(
    model_name: str,
    base_url: str | None = None,
    api_key: str | None = None,
    api_key_vars: list[str] = [],
    config: GenerateConfig = GenerateConfig(),
) -> None
model_name str

Model name.

base_url str | None

Alternate base URL for model.

api_key str | None

API key for model.

api_key_vars list[str]

Environment variables that may contain keys for this provider (used for override)

config GenerateConfig

Model configuration.


Close method for closing any client allocated for the model.

async def close(self) -> None

Generate output from the model.

async def generate(
    input: list[ChatMessage],
    tools: list[ToolInfo],
    tool_choice: ToolChoice,
    config: GenerateConfig,
) -> ModelOutput | tuple[ModelOutput | Exception, ModelCall]
input list[ChatMessage]

Chat message input (if a str is passed it is converted to a ChatUserMessage).

tools list[ToolInfo]

Tools available for the model to call.

tool_choice ToolChoice

Directives to the model as to which tools to prefer.

config GenerateConfig

Model configuration.


Default max_tokens.

def max_tokens(self) -> int | None

Default max_connections.

def max_connections(self) -> int

Scope for enforcement of max_connections.

def connection_key(self) -> str

Is this exception a rate limit error.

def is_rate_limit(self, ex: BaseException) -> bool
ex BaseException

Exception to check for rate limit.


Collapse consecutive user messages into a single message.

def collapse_user_messages(self) -> bool

Collapse consecutive assistant messages into a single message.

def collapse_assistant_messages(self) -> bool

Any tool use in a message stream means that tools must be passed.

def tools_required(self) -> bool

Tool results can contain images

def tool_result_images(self) -> bool

Chat message assistant messages can include reasoning.

def has_reasoning_history(self) -> bool