Inspect evaluations have a large number of options available for logging, tuning, diagnostics and model interctions. These options fall into roughly two categories:
Options that you want to set on a more durable basis (for a project or session).
Options that you want to tweak per-evel to accomodate particular scenarios.
For the former, we recommend you specify these options in a .env file within your project directory, which is covered in the section below. See the Eval Options for details on all available options.
.env Files
While we can include all required options on the inspect eval command line, it’s generally easier to use environment variables for commonly repeated options. To facilitate this, the inspect CLI will automatically read and process .env files located in the current working directory (also searching in parent directories if a .env file is not found in the working directory). This is done using the python-dotenv package).
For example, here’s a .env file that makes available API keys for several providers and sets a bunch of defaults for a working session:
All command line options can also be set via environment variable by using the INSPECT_EVAL_ prefix.
Note that .env files are searched for in parent directories, so if you run an Inspect command from a subdirectory of a parent that has an .env file, it will still be read and resolved. If you define a relative path to INSPECT_LOG_DIR in a .env file, then its location will always be resolved as relative to that .env file (rather than relative to whatever your current working directory is when you run inspect eval).
.env files should never be checked into version control, as they nearly always contain either secret API keys or machine specific paths. A best practice is often to check in an .env.example file to version control which provides an outline (e.g. keys only not values) of variables that are required by the current project.
Specifying Options
Below are sections for the various categories of options supported by inspect eval. Note that all of these options are also available for the eval() function and settable by environment variables. For example:
CLI
eval()
Environment
--model
model
INSPECT_EVAL_MODEL
--sample-id
sample_id
INSPECT_EVAL_SAMPLE_ID
--limit
limit
INSPECT_EVAL_LIMIT
Model Provider
--model
Model used to evaluate tasks.
--model-base-url
Base URL for for model API
--model-config
Model specific arguments (JSON or YAML file)
-M
Model specific arguments (key=value).
Model Generation
--max-tokens
The maximum number of tokens that can be generated in the completion (default is model specific)
--system-message
Override the default system message.
--temperature
What sampling temperature to use, between 0 and 2. Higher values like 0.8 will make the output more random, while lower values like 0.2 will make it more focused and deterministic.
--top-p
An alternative to sampling with temperature, called nucleus sampling, where the model considers the results of the tokens with top_p probability mass.
--top-k
Randomly sample the next word from the top_k most likely next words. Anthropic, Google, HuggingFace, and vLLM only.
--frequency-penalty
Number between -2.0 and 2.0. Positive values penalize new tokens based on their existing frequency in the text so far, decreasing the model’s likelihood to repeat the same line verbatim. OpenAI, Google, Grok, Groq, llama- cpp-python and vLLM only.
--presence-penalty
Number between -2.0 and 2.0. Positive values penalize new tokens based on whether they appear in the text so far, increasing the model’s likelihood to talk about new topics. OpenAI, Google, Grok, Groq, llama-cpp-python and vLLM only.
--logit-bias
Map token Ids to an associated bias value from -100 to 100 (e.g. “42=10,43=-10”). OpenAI and Grok only.
--seed
Random seed. OpenAI, Google, Groq, Mistral, HuggingFace, and vLLM only.
--stop-seqs
Sequences where the API will stop generating further tokens. The returned text will not contain the stop sequence.
--num-choices
How many chat completion choices to generate for each input message. OpenAI, Grok, Google, TogetherAI, and vLLM only.
--best-of
Generates best_of completions server-side and returns the ‘best’ (the one with the highest log probability per token). OpenAI only.
--log-probs
Return log probabilities of the output tokens. OpenAI, Grok, TogetherAI, Huggingface, llama-cpp-python, and vLLM only.
--top-logprobs
Number of most likely tokens (0-20) to return at each token position, each with an associated log probability. OpenAI, Grok, TogetherAI, Huggingface, and vLLM only.
--cache-prompt
Values: auto, true, or false. Cache prompt prefix (Anthropic only). Defaults to “auto”, which will enable caching for requests with tools.
--reasoning-effort
Values: low, medium, or high. Constrains effort on reasoning for reasoning models. Open AI o1 models only.
--parallel-tool-calls
Whether to enable calling multiple functions during tool use (defaults to True) OpenAI and Groq only.
--max-tool-output
Maximum size of tool output (in bytes). Defaults to 16 * 1024.
--internal-tools
Whether to automatically map tools to model internal implementations (e.g. ‘computer’ for Anthropic).
--max-retries
Maximum number of times to retry request (defaults to 5)
--timeout
Request timeout (in seconds).
Tasks and Solvers
--task-config
Task arguments (JSON or YAML file)
-T
Task arguments (key=value)
--solver
Solver to execute (overrides task default solver)
--solver-config
Solver arguments (JSON or YAML file)
-S
Solver arguments (key=value)
Sample Selection
--limit
Limit samples to evaluate by specifying a maximum (e.g. 10) or range (e.g. 10-20)
--sample-id
Evalute a specific sample (e.g. 44) or list of samples (e.g. 44,63,91)
--epochs
Number of times to repeat each sample (defaults to 1)
--epochs-reducer
Method for reducing per-epoch sample scores into a single score. Built in reducers include mean, median, mode, max, at_least_{n}, and pass_at_{k}.
Parallelism
--max-connections
Maximum number of concurrent connections to Model provider (defaults to 10)
--max-samples
Maximum number of samples to run in parallel (default is --max-connections)
--max-subprocesses
Maximum number of subprocesses to run in parallel (default is os.cpu_count())
--max-sandboxes
Maximum number of sandboxes (per-provider) to run in parallel (default is 2 * os.cpu_count())
--max-tasks
Maximum number of tasks to run in parallel (default is 1)
Errors and Limits
--fail-on-error
Threshold of sample errors to tolerate (by default, evals fail when any error occurs). Value between 0 to 1 to set a proportion; value greater than 1 to set a count.
--no-fail-on-error
Do not fail the eval if errors occur within samples (instead, continue running other samples)
--message-limit
Limit on total messages used for each sample.
--token-limit
Limit on total tokens used for each sample.
--time-limit
Limit on total execution time for each sample.
Eval Logs
--log-dir
Directory for log files (defaults to ./logs)
--no-log-samples
Do not log sample details.
--no-log-images
Do not log images and other media.
--log-buffer
Number of samples to buffer before writing log file. If not specified, an appropriate default for the format and filesystem is chosen (10 for most cases, 100 for JSON logs on remote filesystems).
--log-format
Values: eval, json Format for writing log files (defaults to eval).
--log-level
Python logger level for console. Values: debug, trace, http, info, warning, error, critical (defaults to warning)
--log-level-transcript
Python logger level for eval log transcript (values same as --log-level, defaults to info).
Scoring
--no-score
Do not score model output (use the inspect score command to score output later)
--no-score-display
Do not display realtime scoring information.
Sandboxes
--sandbox
Sandbox environment type (with optional config file). e.g. ‘docker’ or ‘docker:compose.yml’
--no-sandbox-cleanup
Do not cleanup sandbox environments after task completes
Debugging
--debug
Wait to attach debugger
--debug-port
Port number for debugger
--debug-errors
Raise task errors (rather than logging them) so they can be debugged.