Errors and Limits

Overview

When developing more complex evaluations, its not uncommon to encounter error conditions during development—these might occur due to a bug in a solver or scorer, an unreliable or overloaded API, or a failure to communicate with a sandbox environment. It’s also possible to end up evals that don’t terminate properly because models continue running in a tool calling loop even though they are “stuck” and very unlikely to make additioanal progress.

This article covers various techniques for dealing with unexpected errors and setting limits on evaluation tasks and samples. Topics covered include:

  1. Retrying failed evaluations (while preserving the samples completed during the initial failed run).
  2. Establishing a threshold (count or percentage) of samples to tolerate errors for before failing an evaluation.
  3. Setting a maximum number of messages in a sample before forcing the model to give up.

Errors and Retries

When an evaluation task fails due to an error or is otherwise interrupted (e.g. by a Ctrl+C), an evaluation log is still written. In many cases errors are transient (e.g. due to network connectivity or a rate limit) and can be subsequently retried.

For these cases, Inspect includes an eval-retry command and eval_retry() function that you can use to resume tasks interrupted by errors (including preserving samples already completed within the original task). For example, if you had a failing task with log file logs/2024-05-29T12-38-43_math_Gprr29Mv.json, you could retry it from the shell with:

$ inspect eval-retry logs/2024-05-29T12-38-43_math_43_math_Gprr29Mv.json

Or from Python with:

eval_retry("logs/2024-05-29T12-38-43_math_43_math_Gprr29Mv.json")

Note that retry only works for tasks that are created from @task decorated functions (as if a Task is created dynamically outside of an @task function Inspect does not know how to reconstruct it for the retry).

Note also that eval_retry() does not overwrite the previous log file, but rather creates a new one (preserving the task_id from the original file).

Here’s an example of retrying a failed eval with a lower number of max_connections (the theory being that too many concurrent connections may have caused a rate limit error):

log = eval(my_task)[0]
if log.status != "success":
  eval_retry(log, max_connections = 3)

Failure Threshold

In some cases you might wish to tolerate some number of errors without failing the evaluation. This might be during development when errors are more commonplace, or could be to deal with a particularly unreliable API used in the evaluation. Add the fail_on_error option to your Task definition to establish this threshold. For example, here we indicate that we’ll tolerate errors in up to 10% of the total sample count before failing:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        plan=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        fail_on_error=0.1,
        scorer=includes(),
        sandbox="docker",
    )

Failed samples are not scored and a warning indicating that some samples failed is both printed in the terminal and shown in Inspect View when this occurs.

You can specify fail_on_error as a boolean (turning the behaviour on and off entirely), as a number between 0 and 1 (indicating a proportion of failures to tolerate), or a number greater than 1 to (indicating a count of failures to tolerate):

Value Behaviour
fail_on_error=True Fail eval immediately on sample errors (default).
fail_on_error=False Never fail eval on sample errors.
fail_on_error=0.1 Fail if more than 10% of total samples have errors.
fail_on_error=5 Fail eval if more than 5 samples have errors.

While fail_on_error is typically specified at the Task level, you can also override the task setting when calling eval() or inspect eval from the CLI. For example:

eval("intercode_ctf.py", fail_on_error=False)

You might choose to do this if you want to tolerate a certain proportion of errors during development but want to ensure there are never errors when running in production.

Messages Limit

In open-ended model conversations (for example, an agent evalution with tool usage) it’s possible that a model will get “stuck” attempting to perform a task with no realistic prospect of completing it. Sometimes models will “give up” but sometimes they won’t! For this type of evaluation it’s normally a good idea to set a limit on total messages. For example, here’s an evaluation task that sets a limit of 30 messages:

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        plan=[
            system_message("system.txt"),
            use_tools([bash(timeout=120)]),
            generate(),
        ],
        max_messages=30,
        scorer=includes(),
        sandbox="docker",
    )

The max_messages=30 argument sets a limit of 30 total messages in the conversation before the model is forced to give up. At that point, whatever output happens to be in the TaskState will be scored (presumably leading to a score of 9 or ‘incorrect’).