Models
Overview
Inspect has built in support for a variety of language model API providers and can be extended to support arbitrary additions ones. Built-in model API providers, their dependencies, and environment variables required to use them are as follows:
Model API | Dependencies | Environment Variables |
---|---|---|
OpenAI | pip install openai |
OPENAI_API_KEY |
Anthropic | pip install anthropic |
ANTHROPIC_API_KEY |
pip install google-generativeai |
GOOGLE_API_KEY |
|
Mistral | pip install mistralai |
MISTRAL_API_KEY |
Grok | pip install openai |
GROK_API_KEY |
TogetherAI | pip install openai |
TOGETHER_API_KEY |
AWS Bedrock | pip install aioboto3 |
AWS_ACCESS_KEY_ID , AWS_SECRET_ACCESS_KEY , and AWS_DEFAULT_REGION |
Azure AI | None required | AZURE_API_KEY and INSPECT_EVAL_MODEL_BASE_URL |
Groq | pip install groq |
GROQ_API_KEY |
Cloudflare | None required | CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN |
Hugging Face | pip install transformers |
None required |
vLLM | pip install vllm |
None required |
Ollama | pip install openai |
None required |
llama-cpp-python | pip install openai |
None required |
Vertex | pip install google-cloud-aiplatform |
None required |
Using Models
To select a model for use in an evaluation task you specify it using a model name. Model names include their API provider and the specific model to use (e.g. openai/gpt-4
) Here are the supported providers along with example model names and links to documentation on all available models:
Provider | Example | Docs |
---|---|---|
OpenAI | openai/gpt-3.5-turbo |
OpenAI Models |
Anthropic | anthropic/claude-2.1 |
Anthropic Models |
google/gemini-1.0-pro |
Google Models | |
Mistral | mistral/mistral-large-latest |
Mistral Models |
Grok | grok/grok-beta |
Grok Models |
Hugging Face | hf/openai-community/gpt2 |
Hugging Face Models |
vLLM | vllm/openai-community/gpt2 |
vLLM Models |
Ollama | ollama/llama3 |
Ollama Models |
llama-cpp-python | llama-cpp-python/llama3 |
llama-cpp-python Models |
TogetherAI | together/google/gemma-7b-it |
TogetherAI Models |
AWS Bedrock | bedrock/meta.llama2-70b-chat-v1 |
AWS Bedrock Models |
Azure AI | azureai/azure-deployment-name |
Azure AI Models |
Vertex | vertex/gemini-1.5-flash |
Google Models |
Groq | groq/mixtral-8x7b-32768 |
Groq Models |
Cloudflare | cf/meta/llama-2-7b-chat-fp16 |
Cloudflare Models |
To select a model for an evaluation, pass it’s name on the command line or use the model
argument of the eval()
function:
$ inspect eval security_guide --model openai/gpt-3.5-turbo
$ inspect eval security_guide --model anthropic/claude-instant-1.2
Or:
eval(security_guide, model="openai/gpt-3.5-turbo")
eval(security_guide, model="anthropic/claude-instant-1.2")
Alternatively, you can set the INSPECT_EVAL_MODEL
environment variable (either in the shell or a .env
file) to select a model externally:
INSPECT_EVAL_MODEL=google/gemini-1.0-pro
If are using Google, Azure AI, AWS Bedrock, Hugging Face, or vLLM you should additionally consult the sections below on using the Azure AI, AWS Bedrock, Google, Hugging Face, and vLLM providers to learn more about available models and their usage and authentication requirements.
Model Base URL
Each model also can use a different base URL than the default (e.g. if running through a proxy server). The base URL can be specified with the same prefix as the API_KEY
, for example, the following are all valid base URLs:
Provider | Environment Variable |
---|---|
OpenAI | OPENAI_BASE_URL |
Anthropic | ANTHROPIC_BASE_URL |
GOOGLE_BASE_URL |
|
Mistral | MISTRAL_BASE_URL |
Grok | GROK_BASE_URL |
TogetherAI | TOGETHER_BASE_URL |
Ollama | OLLAMA_BASE_URL |
llama-cpp-python | LLAMA_CPP_PYTHON_BASE_URL |
AWS Bedrock | BEDROCK_BASE_URL |
Azure AI | AZUREAI_BASE_URL |
Groq | GROQ_BASE_URL |
Cloudflare | CLOUDFLARE_BASE_URL |
In addition, there are separate base URL variables for running various frontier models on Azure and Bedrock:
Provider (Model) | Environment Variable |
---|---|
AzureAI (OpenAI) | AZUREAI_OPENAI_BASE_URL |
AzureAI (Mistral) | AZUREAI_MISTRAL_BASE_URL |
Bedrock (Anthropic) | BEDROCK_ANTHROPIC_BASE_URL |
Generation Config
There are a variety of configuration options that affect the behaviour of model generation. There are options which affect the generated tokens (temperature
, top_p
, etc.) as well as the connection to model providers (timeout
, max_retries
, etc.)
You can specify generation options either on the command line or in direct calls to eval()
. For example:
$ inspect eval --model openai/gpt-4 --temperature 0.9
$ inspect eval --model google/gemini-1.0-pro --max-connections 20
Or:
eval(security_guide, model="openai/gpt-4", temperature=0.9)
eval(security_guide, model="google/gemini-1.0-pro", max_connections=20)
Use inspect eval --help
to learn about all of the available generation config options. |
Connections and Rate Limits
Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.
If you are experiencing rate-limit errors you will need to experiment with the max_connections
option to find the optimal value that keeps you under the rate limit (the section on Parallelism includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for max_connections
as well as other generation options.
Provider Notes
This section provides additional documentation on using the Azure AI, AWS Bedrock, Hugging Face, and vLLM providers.
Azure AI
Azure AI provides hosting of models from OpenAI and Mistral as well as a wide variety of other open models. One special requirement for models hosted on Azure is that you need to specify a model base URL. You can do this using the AZUREAI_OPENAI_BASE_URL
and AZUREAI_MISTRAL_BASE_URL
environment variables or the --model-base-url
command line parameter. You can find the model base URL for your specific deployment in the Azure model admin interface.
OpenAI
To use OpenAI models on Azure AI, specify an AZUREAI_OPENAI_API_KEY
along with an AZUREAI_OPENAI_BASE_URL
. You can then use the normal openai
provider, but you’ll need to specify a model name that corresponds to the Azure Deployment Name of your model. For example, if your deployed model name was gpt4-1106-preview-ythre:
$ export AZUREAI_OPENAI_API_KEY=key
$ export AZUREAI_OPENAI_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model openai/gpt4-1106-preview-ythre
The complete list of environment variables (and how they map to the parameters of the AzureOpenAI
client) is as follows:
api_key
fromAZUREAI_OPENAI_API_KEY
azure_endpoint
fromAZUREAI_OPENAI_BASE_URL
organization
fromOPENAI_ORG_ID
api_version
fromOPENAI_API_VERSION
The OpenAI provider will choose whether to make a connection to the main OpenAI service or Azure based on the presence of environment variables. If the AZUREAI_OPENAI_API_KEY
variable is defined Azure will be used, otherwise OpenAI will be used (via the OPENAI_API_KEY
). You can override this default behaviour using the azure
model argument. For example:
$ inspect eval eval.py -M azure=true # force azure
$ inspect eval eval.py -M azure=false # force no azure
Mistral
To use Mistral models on Azure AI, specify an AZURE_MISTRAL_API_KEY
along with an INSPECT_EVAL_MODEL_BASE_URL
. You can then use the normal mistral
provider, but you’ll need to specify a model name that corresponds to the Azure Deployment Name of your model. For example, if your deployment model name was mistral-large-ctwi:
$ export AZUREAI_MISTRAL_API_KEY=key
$ export AZUREAI_MISTRAL_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model mistral/mistral-large-ctwi
Other Models
Azure AI supports many other model types, you can access these using the azureai
model provider. As with OpenAI and Mistral, you’ll need to specify an AZUREAI_API_KEY
along with an AZUREAI_BASE_URL
, as well as use the Azure Deployment Name of your model as the model name. For example:
$ export AZUREAI_API_KEY=key
$ export AZUREAI_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model azureai/llama-2-70b-chat-wnsnw
AWS Bedrock
AWS Bedrock provides hosting of models from Anthropic as well as a wide variety of other open models. Note that all models on AWS Bedrock require that you request model access before using them in a deployment (in some cases access is granted immediately, in other cases it could one or more days).
You should be sure that you have the appropriate AWS credentials before accessing models on Bedrock. Once credentials are configured, use the bedrock
provider along with the requisite Bedrock model name. For example, here’s how you would access models from a variety of providers:
$ export AWS_ACCESS_KEY_ID=ACCESSKEY
$ export AWS_SECRET_ACCESS_KEY=SECRETACCESSKEY
$ export AWS_DEFAULT_REGION=us-east-1
$ inspect eval bedrock/anthropic.claude-3-haiku-20240307-v1:0
$ inspect eval bedrock/mistral.mistral-7b-instruct-v0:2
$ inspect eval bedrock/meta.llama2-70b-chat-v1
You aren’t likely to need to, but you can also specify a custom base URL for AWS Bedrock using the BEDROCK_BASE_URL
environment variable.
Google models make available safety settings that you can adjust to determine what sorts of requests will be handled (or refused) by the model. The four categories of safety settings are as follows:
Category | Description |
---|---|
sexually_explicit |
Contains references to sexual acts or other lewd content. |
hate_speech |
Content that is rude, disrespectful, or profane. |
harassment |
Negative or harmful comments targeting identity and/or protected attributes. |
dangerous_content |
Promotes, facilitates, or encourages harmful acts. |
For each category, the following block thresholds are available:
Block Threshold | Description |
---|---|
none |
Always show regardless of probability of unsafe content |
only_high |
Block when high probability of unsafe content |
medium_and_above |
Block when medium or high probability of unsafe content |
low_and_above |
Block when low, medium or high probability of unsafe content |
By default, Inspect sets all four categories to none
(enabling all content). You can override these defaults by using the safety_settings
model argument. For example:
= dict(
safety_settings = "medium_and_above",
dangerous_content = "low_and_above"
hate_speech
)eval(
"eval.py",
=dict(safety_settings=safety_settings)
model_args )
This also can be done from the command line:
$ inspect eval eval.py -M "safety_settings={'hate_speech': 'low_and_above'}"
Google Vertex AI
Vertex AI is a different service to Google AI, see a comparison matrix here. Make sure you are using the appropriate model provider.
The core libraries for Vertex AI interact directly with Google Cloud Platform so this provider doesn’t use the standard BASE_URL
/API_KEY
approach that others do. Consequently you don’t need to set these environment variables, instead you should configure your environment appropriately. Additional configuration can be passed in through the vertex_init_args
parameter if required:
$ inspect eval eval.py -M "vertex_init_args={'project': 'my-project', location: 'eu-west2-b'}"
Vertex AI provides the same safety_settings
outlined in the Google provider.
Hugging Face
The Hugging Face provider implements support for local models using the transformers package. You can use any Hugging Face model by specifying it with the hf/
prefix. For example:
$ inspect eval popularity --model hf/openai-community/gpt2
Batching
Concurrency for REST API based models is managed using the max_connections
option. The same option is used for transformers
inference—up to max_connections
calls to generate()
will be batched together (note that batches will proceed at a smaller size if no new calls to generate()
have occurred in the last 2 seconds).
The default batch size for Hugging Face is 32, but you should tune your max_connections
to maximise performance and ensure that batches don’t exceed available GPU memory. The Pipeline Batching section of the transformers documentation is a helpful guide to the ways batch size and performance interact.
Device
The PyTorch cuda
device will be used automatically if CUDA is available (as will the Mac OS mps
device). If you want to override the device used, use the device
model argument. For example:
$ inspect eval popularity --model hf/openai-community/gpt2 -M device=cuda:0
This also works in calls to eval()
:
eval(popularity, model="hf/openai-community/gpt2", model_args=dict(device="cuda:0"))
Or in a call to get_model()
= get_model("hf/openai-community/gpt2", device="cuda:0") model
Local Models
In addition to using models from the Hugging Face Hub, the Hugging Face provider can also use local model weights and tokenizers (e.g. for a locally fine tuned model). Use hf/local
along with the model_path
, and (optionally) tokenizer_path
arguments to select a local model. For example, from the command line, use the -M
flag to pass the model arguments:
$ inspect eval popularity --model hf/local -M model_path=./my-model
Or using the eval()
function:
eval(popularity, model="hf/local", model_args=dict(model_path="./my-model"))
Or in a call to get_model()
= get_model("hf/local", model_path="./my-model") model
vLLM
The vllm
provider also implements support for Hugging Face models using the vllm package. You can access any Hugging Face model by specifying it with the vllm/
prefix. For example:
$ inspect eval popularity --model vllm/openai-community/gpt2
You can also access models from ModelScope rather than Hugging Face, see the vLLM documentation for details on this.
vLLM is generally much faster than the Hugging Face provider as the library is designed entirely for inference speed whereas the Hugging Face library is more general purpose.
Rather than doing inference locally, you can also connect to a remote vLLM server. See the section below on vLLM Server for details).
Device
The device
option is also available for vLLM models, and you can use it to specify the device(s) to run the model on. For example:
$ inspect eval popularity --model vllm/meta-llama/Meta-Llama-3-8B-Instruct -M device='0,1,2,3'
Batching
vLLM automatically handles batching, so you generally don’t have to worry about selecting the optimal batch size. However, you can still use the max_connections
option to control the number of concurrent requests which defaults to 32.
Local Models
Similar to the Hugging Face provider, you can also use local models with the vLLM provider. Use vllm/local
along with the model_path
, and (optionally) tokenizer_path
arguments to select a local model. For example, from the command line, use the -M
flag to pass the model arguments:
$ inspect eval popularity --model vllm/local -M model_path=./my-model
vLLM Server
vLLM provides an HTTP server that implements OpenAI’s Chat API. To use this with Inspect, use the OpenAI provider rather than the vLLM provider, setting the model base URL to point to the vLLM server rather than OpenAI. For example:
$ export OPENAI_BASE_URL=http://localhost:8080/v1
$ export OPENAI_API_KEY=<your-server-api-key>
$ inspect eval ctf.py --model openai/meta-llama/Meta-Llama-3-8B-Instruct
You can also use the CLI arguments --model-base-url
and -M api-key=<your-key>
rather than setting environment variables.
See the vLLM documentation on Server Mode for additional details.
Helper Models
Often you’ll want to use language models in the implementation of Solvers and Scorers. Inspect includes some critique solvers and model graded scorers that do this, and you’ll often want to do the same in your own.
Helper models will by default use the same model instance and configuration as the model being evaluated, however this can be overridden using the model
argument.
= "google/gemini-1.0-pro") self_critique(model
You can also pass a fully instantiated Model
object (for example, if you wanted to override its default configuration) by using the get_model()
function. For example, here we’ll provide custom models for both critique and scoring:
from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.model import GenerateConfig, get_model
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate, self_critique
@task
def theory_of_mind():
= get_model("google/gemini-1.0-pro")
critique_model
= get_model("anthropic/claude-2.1", config = GenerateConfig(
grader_model = 0.9,
temperature = 10
max_connections
))
return Task(
=json_dataset("theory_of_mind.jsonl"),
dataset=[
solver
chain_of_thought(),
generate(),= critique_model)
self_critique(model
],=model_graded_fact(model = grader_model),
scorer )
Model Args
The section above illustrates passing model specific arguments to local models on the command line, in eval()
, and in get_model()
. This actually works for all model types, so if there is an additional aspect of a model you want to tweak that isn’t covered by the GenerateConfig
, you can use this method to do it. For example, here we specify the transport
option for a Google Gemini model:
inspect eval popularity --model google/gemini-1.0-pro -M transport:grpc
The additional model_args
are forwarded as follows for the various providers:
Provider | Forwarded to |
---|---|
OpenAI | AsyncOpenAI |
Anthropic | AsyncAnthropic |
genai.configure |
|
Mistral | Mistral |
Hugging Face | AutoModelForCausalLM.from_pretrained |
vLLM | SamplingParams |
Ollama | AsyncOpenAI |
llama-cpp-python | AsyncOpenAI |
TogetherAI | AsyncOpenAI |
Groq | AsyncGroq |
AzureAI | Chat HTTP Post Body |
Cloudflare | Chat HTTP Post Body |
See the documentation for the requisite model provider for more information on the additional model options that can be passed to these functions and classes.
Custom Models
If you want to support another model hosting service or local model source, you can add a custom model API. See the documentation on Model API Extensions for additional details.