Eval Sets

Overview

Most of the examples in the documentation run a single evaluation task by either passing a script name to inspect eval or by calling the eval() function directly. While this is a good workflow for developing single evaluations, you’ll often want to run several evaluations together as a set. This might be for the purpose of exploring hyperparameters, evaluating on multiple models at one time, or running a full benchmark suite.

The inspect eval-set command and eval_set() function and provide several facilities for running sets of evaluations, including:

  1. Automatically retrying failed evaluations (with a configurable retry strategy)
  2. Re-using samples from failed tasks so that work is not repeated during retries.
  3. Cleaning up log files from failed runs after a task is successfully completed.
  4. The ability to re-run the command multiple times, with work picking up where the last invocation left off.

Below we’ll cover the various tools and techniques available for creating eval sets.

Running Eval Sets

Run a set of evaluations using the inspect eval-set command or eval_set() function. For example:

$ inspect eval-set mmlu.py mathematics.py \
   --model openai/gpt-4o,anthropic/claude-3-5-sonnet-20240620 \
   --log-dir logs-run-42

Or equivalently:

from inspect_ai import eval_set

eval_set(
   tasks=["mmlu.py", "mathematics.py"],
   model=["openai/gpt-4o", "anthropic/claude-3-5-sonnet-20240620"],
   log_dir="logs-run-42"      
)

Note that in both cases we specified a custom log directory—this is actually a requirement for eval sets, as it provides a scope where completed work can be tracked.

Dynamic Tasks

In the above examples tasks are ready from the filesystem. It is also possible to dynamically create a set of tasks and pass them to the eval_set() function. For example:

from inspect_ai import eval_set

@task
def create_task(dataset: str):
  return Task(dataset=csv_dataset(dataset))

mmlu = create_task("mmlu.csv")
maths = create_task("maths.csv")

eval_set(
   [mmlu, maths],
   model=["openai/gpt-4o", "anthropic/claude-3-5-sonnet-20240620"],
   log_dir="logs-run-42"      
)

Notice that we create our tasks from a function decorated with @task. Doing this is a critical requirement because it enables Inspect to capture the arguments to create_task() and use that to distinguish the two tasks (in turn used to pair tasks to log files for retries).

There are two fundamental requirements for dynamic tasks used with eval_set():

  1. They are created using an @task function as described above.
  2. Their parameters use ordinary Python types (like str, int, list, etc.) as opposed to custom objects (which are hard to serialise consistently).

Note that you can pass a solver to an @task function, so long as it was created by a function decorated with @solver.

Retry Options

There are a number of options that control the retry behaviour of eval sets:

Option Description
--retry-attempts Maximum number of retry attempts (defaults to 10)
--retry-wait Time to wait between attempts, increased exponentially. (defaults to 30, resulting in waits of 30, 60, 120, 240, etc.)
--retry-connections Reduce max connections at this rate with each retry (defaults to 0.5)
--no-retry-cleanup Do not cleanup failed log files after retries.

For example, here we specify a base wait time of 120 seconds:

inspect eval-set mmlu.py mathematics.py \
   --log-dir logs-run-42
   --retry-wait 120

Or with the eval_set() function:

eval_set(
   ["mmlu.py", "mathematics.py"],
   log_dir="logs-run-42",
   retry_wait=120
)

Publishing

You can bundle a standalone version of the log viewer for an eval set using the bundling options:

Option Description
--bundle-dir Directory to write standalone log viewer files to.
--bundle-overwrite Overwrite existing bundle directory (defaults to not overwriting).

The bundle directory can then be deployed to any static web server (GitHub Pages, S3 buckets, or Netlify, for example) to provide a standalone version of the log viewer for the eval set. See the section on Log Viewer Publishing for additional details.

Logging Context

We mentioned above that you need to specify a dedicated log directory for each eval set that you run. This requirement exists for a couple of reasons:

  1. The log directory provides a durable record of which tasks are completed so that you can run the eval set as many times as is required to finish all of the work. For example, you might get halfway through a run and then encounter provider rate limit errors. You’ll want to be able to restart the eval set later (potentially even many hours later) and the dedicated log directory enables you to do this.

  2. This enables you to enumerate and analyse all of the eval logs in the suite as a cohesive whole (rather than having them intermixed with the results of other runs).

Once all of the tasks in an eval set are complete, re-running inspect eval-set or eval_set() on the same log directory will be a no-op as there is no more work to do. At this point you can use the list_eval_logs() function to collect up logs for analysis:

results = list_eval_logs("logs-run-42")

If you are calling the eval_set() function it will return a tuple of bool and list[EvalLog], where the bool indicates whether all tasks were completed:

success, logs = eval_set(...)
if success:
    # analyse logs
else:
    # will need to run eval_set again

Note that eval_set() does by default do quite a bit of retrying (up to 10 times by default) so success=False reflects the case where even after all of the retries the tasks were still not completed (this might occur due to a service outage or perhaps bugs in eval code raising runtime errors).

Sample Preservation

When retrying a log file, Inspect will attempt to re-use completed samples from the original task. This can result in substantial time and cost savings compared to starting over from the beginning.

IDs and Shuffling

An important constraint on the ability to re-use completed samples is matching them up correctly with samples in the new task. To do this, Inspect requires stable unique identifiers for each sample. This can be achieved in 1 of 2 ways:

  1. Samples can have an explicit id field which contains the unique identifier; or

  2. You can rely on Inspect’s assignment of an auto-incrementing id for samples, however this will not work correctly if your dataset is shuffled. Inspect will log a warning and not re-use samples if it detects that the dataset.shuffle() method was called, however if you are shuffling by some other means this automatic safeguard won’t be applied.

If dataset shuffling is important to your evaluation and you want to preserve samples for retried tasks, then you should include an explicit id field in your dataset.

Max Samples

Another consideration is max_samples, which is the maximum number of samples to run concurrently within a task. Larger numbers of concurrent samples will result in higher throughput, but will also result in completed samples being written less frequently to the log file, and consequently less total recovable samples in the case of an interrupted task.

By default, Inspect sets the value of max_samples to max_connections + 1 (note that it would rarely make sense to set it lower than max_connections). The default max_connections is 10, which will typically result in samples being written to the log frequently. On the other hand, setting a very large max_connections (e.g. 100 max_connections for a dataset with 100 samples) may result in very few recoverable samples in the case of an interruption.

If your task involves tool calls and/or sandboxes, then you will likely want to set max_samples to greater than max_connections, as your samples will sometimes be calling the model (using up concurrent connections) and sometimes be executing code in the sandbox (using up concurrent subprocess calls). While running tasks you can see the utilization of connections and subprocesses in realtime and tune your max_samples accordingly.

Task Enumeration

When running eval sets tasks can be specified either individually (as in the examples above) or can be enumerated from the filesystem. You can organise tasks in many different ways, below we cover some of the more common options.

Multiple Tasks in a File

The simplest possible organisation would be multiple tasks defined in a single source file. Consider this source file (ctf.py) with two tasks in it:

@task
def jeopardy():
  return Task(
    ...
  )

@task
def attack_defense():
  return Task(
    ...
  )

We can run both of these tasks with the following command (note for this and the remainder of examples we’ll assume that you have let an INSPECT_EVAL_MODEL environment variable so you don’t need to pass the --model argument explicitly):

$ inspect eval-set ctf.py --log-dir logs-run-42

Or equivalently:

eval_set("ctf.py", log_dir="logs-run-42")

Note that during development and debugging we can also run the tasks individually:

$ inspect eval ctf.py@jeopardy

Multiple Tasks in a Directory

Next, let’s consider a multiple tasks in a directory. Imagine you have the following directory structure, where jeopardy.py and attack_defense.py each have one or more @task functions defined:

security/
  import.py
  analyze.py
  jeopardy.py
  attack_defense.py

Here is the listing of all the tasks in the suite:

$ inspect list tasks security
jeopardy.py@crypto
jeopardy.py@decompile
jeopardy.py@packet
jeopardy.py@heap_trouble
attack_defense.py@saar
attack_defense.py@bank
attack_defense.py@voting
attack_defense.py@dns

You can run this eval set as follows:

$ inspect eval-set security --log-dir logs-security-02-09-24

Note that some of the files in this directory don’t contain evals (e.g. import.py and analyze.py). These files are not read or executed by inspect eval-set (which only executes files that contain @task definitions).

If we wanted to run more than one directory we could do so by just passing multiple directory names. For example:

$ inspect eval-set security persuasion --log-dir logs-suite-42

Or equivalently:

eval_set(["security", "persuasion"], log_dir="logs-suite-42")

Listing and Filtering

Recursive Listings

Note that directories or expanded globs of directory names passed to eval-set are recursively scanned for tasks. So you could have a very deep hierarchy of directories, with a mix of task and non task scripts, and the eval-set command or function will discover all of the tasks automatically.

There are some rules for how recursive directory scanning works that you should keep in mind:

  1. Sources files and directories that start with . or _ are not scanned for tasks.
  2. Directories named env, venv, and tests are not scanned for tasks.

Attributes and Filters

Eval suites will sometimes be defined purely by directory structure, but there will be cross-cutting concerns that are also used to filter what is run. For example, you might want to define some tasks as part of a “light” suite that is less expensive and time consuming to run. This is supported by adding attributes to task decorators. For example:

@task(light=True)
def jeopardy():
  return Task(
    ...
  )

Given this, you could list all of the light tasks in security and pass them to eval() as follows:

light_suite = list_tasks(
  "security", 
  filter = lambda task: task.attribs.get("light") is True
)
logs = eval_set(light_suite, log_dir="logs-light-42")

Note that the inspect list tasks command can also be used to enumerate tasks in plain text or JSON (use one or more -F options if you want to filter tasks):

$ inspect list tasks security
$ inspect list tasks security --json
$ inspect list tasks security --json -F light=true

You can feed the results of inspect list tasks into inspect eval-set using xargs as follows:

$ inspect list tasks security | xargs \
   inspect eval-set --log-dir logs-security-42

One important thing to keep in mind when using attributes to filter tasks is that both inspect list tasks (and the underlying list_tasks() function) do not execute code when scanning for tasks (rather they parse it). This means that if you want to use a task attribute in a filtering expression it needs to be a constant (rather than the result of function call). For example:

# this is valid for filtering expressions
@task(light=True)
def jeopardy():
  ...

# this is NOT valid for filtering expressions
@task(light=light_enabled("ctf"))
def jeopardy():
  ...