Eval Sets
Overview
Most of the examples in the documentation run a single evaluation task by either passing a script name to inspect eval
or by calling the eval()
function directly. While this is a good workflow for developing single evaluations, you’ll often want to run several evaluations together as a set. This might be for the purpose of exploring hyperparameters, evaluating on multiple models at one time, or running a full benchmark suite.
The inspect eval-set
command and eval_set()
function and provide several facilities for running sets of evaluations, including:
- Automatically retrying failed evaluations (with a configurable retry strategy)
- Re-using samples from failed tasks so that work is not repeated during retries.
- Cleaning up log files from failed runs after a task is successfully completed.
- The ability to re-run the command multiple times, with work picking up where the last invocation left off.
Below we’ll cover the various tools and techniques available for creating eval sets.
Running Eval Sets
Run a set of evaluations using the inspect eval-set
command or eval_set()
function. For example:
$ inspect eval-set mmlu.py mathematics.py \
--model openai/gpt-4o,anthropic/claude-3-5-sonnet-20240620 \
--log-dir logs-run-42
Or equivalently:
from inspect_ai import eval_set
eval_set(=["mmlu.py", "mathematics.py"],
tasks=["openai/gpt-4o", "anthropic/claude-3-5-sonnet-20240620"],
model="logs-run-42"
log_dir )
Note that in both cases we specified a custom log directory—this is actually a requirement for eval sets, as it provides a scope where completed work can be tracked.
Dynamic Tasks
In the above examples tasks are ready from the filesystem. It is also possible to dynamically create a set of tasks and pass them to the eval_set()
function. For example:
from inspect_ai import eval_set
@task
def create_task(dataset: str):
return Task(dataset=csv_dataset(dataset))
= create_task("mmlu.csv")
mmlu = create_task("maths.csv")
maths
eval_set(
[mmlu, maths],=["openai/gpt-4o", "anthropic/claude-3-5-sonnet-20240620"],
model="logs-run-42"
log_dir )
Notice that we create our tasks from a function decorated with @task
. Doing this is a critical requirement because it enables Inspect to capture the arguments to create_task()
and use that to distinguish the two tasks (in turn used to pair tasks to log files for retries).
There are two fundamental requirements for dynamic tasks used with eval_set()
:
- They are created using an
@task
function as described above. - Their parameters use ordinary Python types (like
str
,int
,list
, etc.) as opposed to custom objects (which are hard to serialise consistently).
Note that you can pass a solver
to an @task
function, so long as it was created by a function decorated with @solver
.
Retry Options
There are a number of options that control the retry behaviour of eval sets:
Option | Description |
---|---|
--retry-attempts |
Maximum number of retry attempts (defaults to 10) |
--retry-wait |
Time to wait between attempts, increased exponentially. (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.) |
--retry-connections |
Reduce max connections at this rate with each retry (defaults to 0.5) |
--no-retry-cleanup |
Do not cleanup failed log files after retries. |
For example, here we specify a base wait time of 120 seconds:
inspect eval-set mmlu.py mathematics.py \
--log-dir logs-run-42
--retry-wait 120
Or with the eval_set()
function:
eval_set("mmlu.py", "mathematics.py"],
[="logs-run-42",
log_dir=120
retry_wait )
Publishing
You can bundle a standalone version of the log viewer for an eval set using the bundling options:
Option | Description |
---|---|
--bundle-dir |
Directory to write standalone log viewer files to. |
--bundle-overwrite |
Overwrite existing bundle directory (defaults to not overwriting). |
The bundle directory can then be deployed to any static web server (GitHub Pages, S3 buckets, or Netlify, for example) to provide a standalone version of the log viewer for the eval set. See the section on Log Viewer Publishing for additional details.
Logging Context
We mentioned above that you need to specify a dedicated log directory for each eval set that you run. This requirement exists for a couple of reasons:
The log directory provides a durable record of which tasks are completed so that you can run the eval set as many times as is required to finish all of the work. For example, you might get halfway through a run and then encounter provider rate limit errors. You’ll want to be able to restart the eval set later (potentially even many hours later) and the dedicated log directory enables you to do this.
This enables you to enumerate and analyse all of the eval logs in the suite as a cohesive whole (rather than having them intermixed with the results of other runs).
Once all of the tasks in an eval set are complete, re-running inspect eval-set
or eval_set()
on the same log directory will be a no-op as there is no more work to do. At this point you can use the list_eval_logs()
function to collect up logs for analysis:
= list_eval_logs("logs-run-42") results
If you are calling the eval_set()
function it will return a tuple of bool
and list[EvalLog]
, where the bool
indicates whether all tasks were completed:
= eval_set(...)
success, logs if success:
# analyse logs
else:
# will need to run eval_set again
Note that eval_set() does by default do quite a bit of retrying (up to 10 times by default) so success=False
reflects the case where even after all of the retries the tasks were still not completed (this might occur due to a service outage or perhaps bugs in eval code raising runtime errors).
Sample Preservation
When retrying a log file, Inspect will attempt to re-use completed samples from the original task. This can result in substantial time and cost savings compared to starting over from the beginning.
IDs and Shuffling
An important constraint on the ability to re-use completed samples is matching them up correctly with samples in the new task. To do this, Inspect requires stable unique identifiers for each sample. This can be achieved in 1 of 2 ways:
Samples can have an explicit
id
field which contains the unique identifier; orYou can rely on Inspect’s assignment of an auto-incrementing
id
for samples, however this will not work correctly if your dataset is shuffled. Inspect will log a warning and not re-use samples if it detects that thedataset.shuffle()
method was called, however if you are shuffling by some other means this automatic safeguard won’t be applied.
If dataset shuffling is important to your evaluation and you want to preserve samples for retried tasks, then you should include an explicit id
field in your dataset.
Max Samples
Another consideration is max_samples
, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.
By default, Inspect sets the value of max_samples
to max_connections + 1
(note that it would rarely make sense to set it lower than max_connections
). The default max_connections
is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large max_connections
(e.g. 100 max_connections
for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.
If your task involves tool calls and/or sandboxes, then you will likely want to set max_samples
to greater than max_connections
, as your samples will sometimes be calling the model (using up concurrent connections) and sometimes be executing code in the sandbox (using up concurrent subprocess calls). While running tasks you can see the utilization of connections and subprocesses in realtime and tune your max_samples
accordingly.
Task Enumeration
When running eval sets tasks can be specified either individually (as in the examples above) or can be enumerated from the filesystem. You can organise tasks in many different ways, below we cover some of the more common options.
Multiple Tasks in a File
The simplest possible organisation would be multiple tasks defined in a single source file. Consider this source file (ctf.py
) with two tasks in it:
@task
def jeopardy():
return Task(
...
)
@task
def attack_defense():
return Task(
... )
We can run both of these tasks with the following command (note for this and the remainder of examples we’ll assume that you have let an INSPECT_EVAL_MODEL
environment variable so you don’t need to pass the --model
argument explicitly):
$ inspect eval-set ctf.py --log-dir logs-run-42
Or equivalently:
"ctf.py", log_dir="logs-run-42") eval_set(
Note that during development and debugging we can also run the tasks individually:
$ inspect eval ctf.py@jeopardy
Multiple Tasks in a Directory
Next, let’s consider a multiple tasks in a directory. Imagine you have the following directory structure, where jeopardy.py
and attack_defense.py
each have one or more @task
functions defined:
security/
import.py
analyze.py
jeopardy.py
attack_defense.py
Here is the listing of all the tasks in the suite:
list tasks security
$ inspect @crypto
jeopardy.py@decompile
jeopardy.py@packet
jeopardy.py@heap_trouble
jeopardy.py@saar
attack_defense.py@bank
attack_defense.py@voting
attack_defense.py@dns attack_defense.py
You can run this eval set as follows:
$ inspect eval-set security --log-dir logs-security-02-09-24
Note that some of the files in this directory don’t contain evals (e.g. import.py
and analyze.py
). These files are not read or executed by inspect eval-set
(which only executes files that contain @task
definitions).
If we wanted to run more than one directory we could do so by just passing multiple directory names. For example:
$ inspect eval-set security persuasion --log-dir logs-suite-42
Or equivalently:
"security", "persuasion"], log_dir="logs-suite-42") eval_set([
Listing and Filtering
Recursive Listings
Note that directories or expanded globs of directory names passed to eval-set
are recursively scanned for tasks. So you could have a very deep hierarchy of directories, with a mix of task and non task scripts, and the eval-set
command or function will discover all of the tasks automatically.
There are some rules for how recursive directory scanning works that you should keep in mind:
- Sources files and directories that start with
.
or_
are not scanned for tasks. - Directories named
env
,venv
, andtests
are not scanned for tasks.
Attributes and Filters
Eval suites will sometimes be defined purely by directory structure, but there will be cross-cutting concerns that are also used to filter what is run. For example, you might want to define some tasks as part of a “light” suite that is less expensive and time consuming to run. This is supported by adding attributes to task decorators. For example:
@task(light=True)
def jeopardy():
return Task(
... )
Given this, you could list all of the light tasks in security
and pass them to eval()
as follows:
= list_tasks(
light_suite "security",
filter = lambda task: task.attribs.get("light") is True
)= eval_set(light_suite, log_dir="logs-light-42") logs
Note that the inspect list tasks
command can also be used to enumerate tasks in plain text or JSON (use one or more -F
options if you want to filter tasks):
$ inspect list tasks security
$ inspect list tasks security --json
$ inspect list tasks security --json -F light=true
You can feed the results of inspect list tasks
into inspect eval-set
using xargs
as follows:
$ inspect list tasks security | xargs \
--log-dir logs-security-42 inspect eval-set
One important thing to keep in mind when using attributes to filter tasks is that both inspect list tasks
(and the underlying list_tasks()
function) do not execute code when scanning for tasks (rather they parse it). This means that if you want to use a task attribute in a filtering expression it needs to be a constant (rather than the result of function call). For example:
# this is valid for filtering expressions
@task(light=True)
def jeopardy():
...
# this is NOT valid for filtering expressions
@task(light=light_enabled("ctf"))
def jeopardy():
...