Using Models
Overview
Inspect has support for a wide variety of language model APIs and can be extended to support arbitrary additional ones. Support for the following providers is built in to Inspect:
Lab APIs | OpenAI, Anthropic, Google, Grok, Mistral |
Cloud APIs | AWS Bedrock, Azure AI, Vertex AI |
Open (Hosted) | Groq, Together AI, Cloudflare, Goodfire |
Open (Local) | Hugging Face, vLLM, Ollama, Lllama-cpp-python |
If the provider you are using is not listed above, you may still be able to use it if:
It is available via OpenRouter (see the docs on using OpenRouter with Inspect).
It provides an OpenAI compatible API endpoint. In this scenario, use the Inspect OpenAI interface and set the
OPENAI_BASE_URL
environment variable to the apprpriate value for your provider.
You can also create Model API Extensions to add model providers using their native interface.
Below we’ll describe various ways to specify and provide options to models in Inspect evaluations. Review this first, then see the provider-specific sections for additional usage details and available options.
Selecting a Model
To select a model for an evaluation, pass it’s name on the command line or use the model
argument of the eval() function:
inspect eval arc.py --model openai/gpt-4o-mini
inspect eval arc.py --model anthropic/claude-3-5-sonnet-latest
Or:
eval("arc.py", model="openai/gpt-4o-mini")
eval("arc.py", model="anthropic/claude-3-5-sonnet-latest")
Alternatively, you can set the INSPECT_EVAL_MODEL
environment variable (either in the shell or a .env
file) to select a model externally:
INSPECT_EVAL_MODEL=google/gemini-1.5-pro
Generation Config
There are a variety of configuration options that affect the behaviour of model generation. There are options which affect the generated tokens (temperature
, top_p
, etc.) as well as the connection to model providers (timeout
, max_retries
, etc.)
You can specify generation options either on the command line or in direct calls to eval(). For example:
inspect eval arc.py --model openai/gpt-4 --temperature 0.9
inspect eval arc.py --model google/gemini-1.0-pro --max-connections 20
Or:
eval("arc.py", model="openai/gpt-4", temperature=0.9)
eval("arc.py", model="google/gemini-1.0-pro", max_connections=20)
Use inspect eval --help
to learn about all of the available generation config options.
Model Args
If there is an additional aspect of a model you want to tweak that isn’t covered by the GenerateConfig, you can use model args to pass additional arguments to model clients. For example, here we specify the transport
option for a Google Gemini model:
inspect eval arc.py --model google/gemini-1.0-pro -M transport:grpc
See the documentation for the requisite model provider for information on how model args are passed through to model clients.
Max Connections
Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.
By default, Inspect uses a max_connections
value of 10. You can increase this consistent with your account limits. If you are experiencing rate-limit errors you will need to experiment with the max_connections
option to find the optimal value that keeps you under the rate limit (the section on Parallelism includes additional documentation on how to do this).
Model API
The --model
which is set for an evaluation is automatically used by the generate() solver, as well as for other solvers and scorers built to use the currently evaluated model. If you are implementing a Solver or Scorer and want to use the currently evaluated model, call get_model() with no arguments:
from inspect_ai.model import get_model
= get_model()
model = await model.generate("Say hello") response
If you want to use other models in your solvers and scorers, call get_model() with an alternate model name, along with optional generation config. For example:
= get_model("openai/gpt-4o")
model
= get_model(
model "openai/gpt-4o",
=GenerateConfig(temperature=0.9)
config )
You can also pass provider specific parameters as additional arguments to get_model(). For example:
= get_model("hf/openai-community/gpt2", device="cuda:0") model
Model Caching
By default, calls to get_model() are memoized, meaning that calls with identical parameters resolve to a cached version of the model. You can disable this by passing memoize=False
:
= get_model("openai/gpt-4o", memoize=False) model
Finally, if you prefer to create and fully close model clients at their place of use, you can use the async context manager built in to the Model class. For example:
async with get_model("openai/gpt-4o") as model:
= await model.generate("Say hello") response
Learning More
Providers covers usage details and available options for the various supported providers.
Caching explains how to cache model output to reduce the number of API calls made.
Multimodal describes the APIs available for creating multimodal evaluations (including images, audio, and video).
Reasoning documents the additional options and data available for reasoning models.