Models

Overview

Inspect has built in support for a variety of language model API providers and can be extended to support arbitrary additions ones. Built-in model API providers, their dependencies, and environment variables required to use them are as follows:

Model API Dependencies Environment Variables
OpenAI pip install openai OPENAI_API_KEY
Anthropic pip install anthropic ANTHROPIC_API_KEY
Google pip install google-generativeai GOOGLE_API_KEY
Mistral pip install mistralai MISTRAL_API_KEY
TogetherAI pip install openai TOGETHER_API_KEY
AWS Bedrock pip install boto3 AWS_ACCESS_KEY_ID, AWS_SECRET_ACCESS_KEY, and AWS_DEFAULT_REGION
Azure AI None required AZURE_API_KEY and INSPECT_EVAL_MODEL_BASE_URL
Cloudflare None required CLOUDFLARE_ACCOUNT_ID and CLOUDFLARE_API_TOKEN
Hugging Face pip install transformers None required
vLLM pip install vllm None required
Ollama pip install openai None required

Note that some providers (Ollama and TogetherAI) support the OpenAI Python package as a client, which is why you need to pip install openai for these providers even though you aren’t actually interacting with the OpenAI service when you use them.

Using Models

To select a model for use in an evaluation task you specify it using a model name. Model names include their API provider and the specific model to use (e.g. openai/gpt-4) Here are the supported providers along with example model names and links to documentation on all available models:

Provider Model Name Docs
OpenAI openai/gpt-3.5-turbo OpenAI Models
Anthropic anthropic/claude-2.1 Anthropic Models
Google google/gemini-1.0-pro Google Models
Mistral mistral/mistral-large-latest Mistral Models
Hugging Face hf/openai-community/gpt2 Hugging Face Models
vLLM vllm/openai-community/gpt2 vLLM Models
Ollama ollama/llama3 Ollama Models
TogetherAI together/lmsys/vicuna-13b-v1.5 TogetherAI Models
AWS Bedrock bedrock/meta.llama2-70b-chat-v1 AWS Bedrock Models
Azure AI azureai/azure-deployment-name Azure AI Models
Cloudflare cf/meta/llama-2-7b-chat-fp16 Cloudflare Models

To select a model for an evaluation, pass it’s name on the command line or use the model argument of the eval() function:

$ inspect eval security_guide --model openai/gpt-3.5-turbo
$ inspect eval security_guide --model anthropic/claude-instant-1.2

Or:

eval(security_guide, model="openai/gpt-3.5-turbo")
eval(security_guide, model="anthropic/claude-instant-1.2")

Alternatively, you can set the INSPECT_EVAL_MODEL environment variable (either in the shell or a .env file) to select a model externally:

INSPECT_EVAL_MODEL=google/gemini-1.0-pro

If are using Azure AI, AWS Bedrock, Hugging Face, or vLLM you should additionally consult the sections below on using the Azure AI, AWS Bedrock, Hugging Face, and vLLM providers to learn more about available models and their usage and authentication requirements.

Model Base URL

Each model also can use a different base URL than the default (e.g. if running through a proxy server). The base URL can be specified with the same prefix as the API_KEY, for example, the following are all valid base URLs:

Provider Environment Variable
OpenAI OPENAI_BASE_URL
Anthropic ANTHROPIC_BASE_URL
Google GOOGLE_BASE_URL
Mistral MISTRAL_BASE_URL
TogetherAI TOGETHER_BASE_URL
Ollama OLLAMA_BASE_URL
AWS Bedrock BEDROCK_BASE_URL
Azure AI AZUREAI_BASE_URL
Cloudflare CLOUDFLARE_BASE_URL

In addition, there are separate base URL variables for running various frontier models on Azure and Bedrock:

Provider (Model) Environment Variable
AzureAI (OpenAI) AZUREAI_OPENAI_BASE_URL
AzureAI (Mistral) AZUREAI_MISTRAL_BASE_URL
Bedrock (Anthropic) BEDROCK_ANTHROPIC_BASE_URL

Generation Config

There are a variety of configuration options that affect the behaviour of model generation. There are options which affect the generated tokens (temperature, top_p, etc.) as well as the connection to model providers (timeout, max_retries, etc.)

You can specify generation options either on the command line or in direct calls to eval(). For example:

$ inspect eval --model openai/gpt-4 --temperature 0.9
$ inspect eval --model google/gemini-1.0-pro --max-connections 20

Or:

eval(security_guide, model="openai/gpt-4", temperature=0.9)
eval(security_guide, model="google/gemini-1.0-pro", max_connections=20)

Use inspect eval --help to learn about all of the available generation config options. |

Connections and Rate Limits

Inspect uses an asynchronous architecture to run task samples in parallel. If your model provider can handle 100 concurrent connections, then Inspect can utilise all of those connections to get the highest possible throughput. The limiting factor on parallelism is therefore not typically local parallelism (e.g. number of cores) but rather what the underlying rate limit is for your interface to the provider.

If you are experiencing rate-limit errors you will need to experiment with the max_connections option to find the optimal value that keeps you under the rate limit (the section on Parallelism includes additional documentation on how to do this). Note that the next section describes how you can set a model-provider specific value for max_connections as well as other generation options.

Model Specific Configuration

In some cases you’ll want to vary generation configuration options by model provider. You can do this by adding a model argument to your task function. You can use the model in a pattern matching statement to condition on different models. For example:

@task
def popularity(model):
    # condition temperature on model
    config = GenerateConfig()
    match model:
        case "gpt" | "gemini":
            config.temperature = 0.9
        case "claude":
            config.temperature = 0.8

    return Task(
        dataset=json_dataset("popularity.jsonl"),
        plan=[system_message(SYSTEM_MESSAGE), generate()],
        scorer=match(),
        config=config,
    )

Provider Notes

This section provides additional documentation on using the Azure AI, AWS Bedrock, Hugging Face, and vLLM providers.

Azure AI

Azure AI provides hosting of models from OpenAI and Mistral as well as a wide variety of other open models. One special requirement for models hosted on Azure is that you need to specify a model base URL. You can do this using the AZUREAI_OPENAI_BASE_URL and AZUREAI_MISTRAL_BASE_URL environment variables or the --model-base-url command line parameter. You can find the model base URL for your specific deployment in the Azure model admin interface.

OpenAI

To use OpenAI models on Azure AI, specify an AZUREAI_OPENAI_API_KEY along with an AZUREAI_OPENAI_BASE_URL. You can then use the normal openai provider, but you’ll need to specify a model name that corresponds to the Azure Deployment Name of your model. For example, if your deployed model name was gpt4-1106-preview-ythre:

$ export AZUREAI_OPENAI_API_KEY=key
$ export AZUREAI_OPENAI_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model openai/gpt4-1106-preview-ythre

The complete list of environment variables (and how they map to the parameters of the AzureOpenAI client) is as follows:

  • api_key from AZUREAI_OPENAI_API_KEY
  • azure_endpoint from AZUREAI_OPENAI_BASE_URL
  • organization from OPENAI_ORG_ID
  • api_version from OPENAI_API_VERSION

Mistral

To use Mistral models on Azure AI, specify an AZURE_MISTRAL_API_KEY along with an INSPECT_EVAL_MODEL_BASE_URL. You can then use the normal mistral provider, but you’ll need to specify a model name that corresponds to the Azure Deployment Name of your model. For example, if your deployment model name was mistral-large-ctwi:

$ export AZUREAI_MISTRAL_API_KEY=key
$ export AZUREAI_MISTRAL_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model mistral/mistral-large-ctwi

Other Models

Azure AI supports many other model types, you can access these using the azureai model provider. As with OpenAI and Mistral, you’ll need to specify an AZUREAI_API_KEY along with an AZUREAI_BASE_URL, as well as use the Azure Deployment Name of your model as the model name. For example:

$ export AZUREAI_API_KEY=key
$ export AZUREAI_BASE_URL=https://your-url-at.azure.com
$ inspect eval --model azureai/llama-2-70b-chat-wnsnw

AWS Bedrock

AWS Bedrock provides hosting of models from Anthropic as well as a wide variety of other open models. Note that all models on AWS Bedrock require that you request model access before using them in a deployment (in some cases access is granted immediately, in other cases it could one or more days).

You should be sure that you have the appropriate AWS credentials before accessing models on Bedrock. Once credentials are configured, use the bedrock provider along with the requisite Bedrock model name. For example, here’s how you would access models from a variety of providers:

$ export AWS_ACCESS_KEY_ID=ACCESSKEY
$ export AWS_SECRET_ACCESS_KEY=SECRETACCESSKEY
$ export AWS_DEFAULT_REGION=us-east-1

$ inspect eval bedrock/anthropic.claude-3-haiku-20240307-v1:0
$ inspect eval bedrock/mistral.mistral-7b-instruct-v0:2
$ inspect eval bedrock/meta.llama2-70b-chat-v1

You aren’t likely to need to, but you can also specify a custom base URL for AWS Bedrock using the BEDROCK_BASE_URL environment variable.

Hugging Face

The Hugging Face provider implements support for local models using the transformers package. You can use any Hugging Face model by specifying it with the hf/ prefix. For example:

$ inspect eval popularity --model hf/openai-community/gpt2

Batching

Concurrency for REST API based models is managed using the max_connections option. The same option is used for transformers inference—up to max_connections calls to generate() will be batched together (note that batches will proceed at a smaller size if no new calls to generate() have occurred in the last 2 seconds).

The default batch size for Hugging Face is 32, but you should tune your max_connections to maximise performance and ensure that batches don’t exceed available GPU memory. The Pipeline Batching section of the transformers documentation is a helpful guide to the ways batch size and performance interact.

Device

The PyTorch cuda device will be used automatically if CUDA is available (as will the Mac OS mps device). If you want to override the device used, use the device model argument. For example:

$ inspect eval popularity --model hf/openai-community/gpt2 -M device=cuda:0

This also works in calls to eval():

eval(popularity, model="hf/openai-community/gpt2", model_args=dict(device="cuda:0"))

Or in a call to get_model()

model = get_model("hf/openai-community/gpt2", device="cuda:0")

Local Models

In addition to using models from the Hugging Face Hub, the Hugging Face provider can also use local model weights and tokenizers (e.g. for a locally fine tuned model). Use hf/local along with the model_path, and (optionally) tokenizer_path arguments to select a local model. For example, from the command line, use the -M flag to pass the model arguments:

$ inspect eval popularity --model hf/local -M model_path=./my-model

Or using the eval() function:

eval(popularity, model="hf/local", model_args=dict(model_path="./my-model"))

Or in a call to get_model()

model = get_model("hf/local", model_path="./my-model")

vLLM

The vLLM provider is currently available only in the development version of Inspect. You can intstall the development version with:

$ pip install git+https://github.com/ukgovernmentbeis/inspect_ai.git

The vllm provider also implements support for Hugging Face models using the vllm package. You can access any Hugging Face model by specifying it with the vllm/ prefix. For example:

$ inspect eval popularity --model vllm/openai-community/gpt2

You can also access models from ModelScope rather than Hugging Face, see the vLLM documentation for details on this.

vLLM is generally much faster than the Hugging Face provider as the library is designed entirely for inference speed whereas the Hugging Face library is more general purpose.

Rather than doing inference locally, you can also connect to a remote vLLM server. See the section below on vLLM Server for details).

Device

The device option is also available for vLLM models, and you can use it to specify the device(s) to run the model on. For example:

$ inspect eval popularity --model vllm/meta-llama/Meta-Llama-3-8B-Instruct -M device='0,1,2,3'

Batching

vLLM automatically handles batching, so you generally don’t have to worry about selecting the optimal batch size. However, you can still use the max_connections option to control the number of concurrent requests which defaults to 32.

Local Models

Similar to the Hugging Face provider, you can also use local models with the vLLM provider. Use vllm/local along with the model_path, and (optionally) tokenizer_path arguments to select a local model. For example, from the command line, use the -M flag to pass the model arguments:

$ inspect eval popularity --model vllm/local -M model_path=./my-model

vLLM Server

vLLM provides an HTTP server that implements OpenAI’s Chat API. To use this with Inspect, use the OpenAI provider rather than the vLLM provider, setting the model base URL to point to the vLLM server rather than OpenAI. For example:

$ export OPENAI_BASE_URL=http://localhost:8080/v1
$ export OPENAI_API_KEY=<your-server-api-key>
$ inspect eval ctf.py --model openai/meta-llama/Meta-Llama-3-8B-Instruct

You can also use the CLI arguments --model-base-url and -M api-key=<your-key> rather than setting environment variables.

See the vLLM documentation on Server Mode for additional details.

Helper Models

Often you’ll want to use language models in the implementation of Solvers and Scorers. Inspect includes some critique solvers and model graded scorers that do this, and you’ll often want to do the same in your own.

Helper models will by default use the same model instance and configuration as the model being evaluated, however this can be overridden using the model argument.

self_critique(model = "google/gemini-1.0-pro")

You can also pass a fully instantiated Model object (for example, if you wanted to override its default configuration) by using the get_model() function. For example, here we’ll provide custom models for both critique and scoring:

from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.model import GenerateConfig, get_model
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate, self_critique

@task
def theory_of_mind():

  critique_model = get_model("google/gemini-1.0-pro")

  grader_model = get_model("anthropic/claude-2.1", config = GenerateConfig(
    temperature = 0.9,
    max_connections = 10
  ))

  return Task(
     dataset=json_dataset("theory_of_mind.jsonl"),
     plan=[
         chain_of_thought(),
         generate(),
         self_critique(model = critique_model)
     ],
     scorer=model_graded_fact(model = grader_model),
  )

Model Args

The section above illustrates passing model specific arguments to local models on the command line, in eval(), and in get_model(). This actually works for all model types, so if there is an additional aspect of a model you want to tweak that isn’t covered by the GenerateConfig, you can use this method to do it. For example, here we specify the transport option for a Google Gemini model:

inspect eval popularity --model google/gemini-1.0-pro -M transport:grpc

The additional model_args are forwarded as follows for the various providers:

Provider Forwarded to
OpenAI AsyncOpenAI
Anthropic AsyncAnthropic
Google genai.configure
Mistral MistralAsyncClient
Hugging Face AutoModelForCausalLM.from_pretrained
vLLM SamplingParams
Ollama AsyncOpenAI
TogetherAI AsyncOpenAI
AzureAI Chat HTTP Post Body
Cloudflare Chat HTTP Post Body

See the OpenAI, Anthropic, Google, Mistral, Hugging Face, vLLM, Ollama, TogetherAI, Azure AI, and Cloudflare provider documentation for more information on the additional options available.

Custom Models

If you want to support another model hosting service or local model source, you can add a custom model API. See the documentation on Model API Extensions for additional details.