Tasks

Overview

This article documents both basic and advanced use of Inspect tasks, which are the fundamental unit of integration for datasets, solvers, and scorers. The following topics are explored:

Task Basics describes the core components and options of tasks.
Parameters covers adding parameters to tasks to make them flexible and adaptable.
Solvers describes how to create tasks that can be used with many different solvers.
Task Reuse documents how to flexibly derive new tasks from existing task definitions.
Exploratory provides guidance on doing exploratory task and solver development.

Task Basics

Tasks provide a recipe for an evaluation consisting minimally of a dataset, a solver, and a scorer (and possibly other options) and is returned from a function decorated with @task. For example:

from inspect_ai import Task, task
from inspect_ai.dataset import json_datasets
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import chain_of_thought, generate

@task
def security_guide():
    return Task(
        dataset=json_dataset("security_guide.json"),
        solver=[chain_of_thought(), generate()],
        scorer=model_graded_fact()
    )

For convenience, tasks always define a default solver. That said, it is often desirable to design tasks that can work with any solver so that you can experiment with different strategies. The Solvers section below goes into depth on how to create tasks that can be flexibly used with any solver.

Task Options

While many tasks can be defined with only a dataset, solver, and scorer, there are lots of other useful Task options. We won’t describe these options in depth here, but rather provide a list along with links to other sections of the documentation that cover their usage:

Option	Description	Docs
`epochs`	Epochs to run for each dataset sample.	Epochs
`config`	Config for model generation.	Generate Config
`setup`	Setup solver(s) to run prior to the main solver.	Sample Setup
`sandbox`	Sandbox configuration for un-trusted code execution.	Sandboxing
`approval`	Approval policy for tool calls.	Tool Approval
`metrics`	Metrics to use in place of scorer metrics.	Scoring Metrics
`fail_on_error`	Failure tolerance for samples.	Sample Failure
`message_limit` `token_limit` `time_limit` `working_limit`	Limits to apply to sample execution.	Sample Limits
`name` `version` `metadata`	Eval log attributes for task.	Eval Logs

You by and large don’t need to worry about these options until you want to use the features they are linked to.

Parameters

Task parameters make it easy to run variants of your task without changing its source code. Task parameters are simply the arguments to your @task decorated function. For example, here we provide parameters (and default values) for system and grader prompts, as well as the grader model:

security.py

from inspect_ai import Task, task
from inspect_ai.dataset import example_dataset
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message

@task
def security_guide(
    system="devops.txt", 
    grader="expert.txt",
    grader_model="openai/gpt-4o"
):
   return Task(
      dataset=example_dataset("security_guide"),
      solver=[system_message(system), generate()],
      scorer=model_graded_fact(
          template=grader, model=grader_model
      )
   )

Let’s say we had an alternate system prompt in a file named "researcher.txt". We could run the task with this prompt as follows:

inspect eval security.py -T system="researcher.txt"

The -T CLI flag is used to specify parameter values. You can include multiple -T flags. For example:

inspect eval security.py \
   -T system="researcher.txt" -T grader="hacker.txt"

If you have several task paramaters you want to specify together, you can put them in a YAML or JSON file and use the --task-config CLI option. For example:

config.yaml

system: "researcher.txt"
grader: "hacker.txt"

Reference this file from the CLI with:

inspect eval security.py --task-config=config.yaml

Solvers

While tasks always include a default solver, you can also vary the solver to explore other strategies and elicitation techniques. This section covers best practices for creating solver-independent tasks.

Solver Parameter

If you want to make your task work with a variety of solvers, the first thing to do is add a solver parameter to your task function. For example, let’s start with a CTF challenge task where the solver is hard-coded:

from inspect_ai import Task, task
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python
from inspect_ai.scorer import includes

@task
def ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([
                bash(timeout=180), 
                python(timeout=180)
            ]),
            generate()
        ],
        sandbox="docker",
        scorer=includes()
    )

This task uses the most naive solver possible (a simple tool use loop with no additional elicitation). That might be okay for initial task development, but we’ll likely want to try lots of different strategies. We start by breaking the solver into its own function and adding an alternative solver that uses the basic_agent()

from inspect_ai import Task, task
from inspect_ai.solver import basic_agent, chain, generate, use_tools
from inspect_ai.tool import bash, python
from inspect_ai.scorer import includes

@solver
def ctf_tool_loop():
    reutrn chain([
        use_tools([
            bash(timeout=180), 
            python(timeout=180)
        ]),
        generate()
    ])

@solver
def ctf_agent(max_attempts: int = 3):
    return basic_agent(
        tools=[
            bash(timeout=180), 
            python(timeout=180)
        ],
        max_attempts=max_attempts,
    ) 
 
@task
def ctf(solver: Solver | None = None):
    # use default tool loop solver if no solver specified
    if solver is None:
        solver = ctf_tool_loop()
   
    # return task
    return Task(
        dataset=read_dataset(),
        solver=solver,
        sandbox="docker",
        scorer=includes()
    )

Note that we use the chain() function to combine multiple solvers into a composite one.

You can now switch between solvers when running the evaluation:

# run with the default solver (ctf_tool_loop)
inspect eval ctf.py 

# run with the ctf agent solver
inspect eval ctf.py --solver=ctf_agent

# run with a different max_attempts
inspect eval ctf.py --solver=ctf_agent -S max_attempts=5

Note the use of the -S CLI option to pass an alternate value for max_attempts to the ctf_agent() solver.

Setup Parameter

In some cases, there will be important steps in the setup of a task that should not be substituted when another solver is used with the task. For example, you might have a step that does dynamic prompt engineering based on values in the sample metadata or you might have a step that initialises resources in a sample’s sandbox.

In these scenarios you can define a setup solver that is always run even when another solver is substituted. For example, here we adapt our initial example to include a setup step:

# prompt solver which should always be run
@solver
def ctf_prompt():
    async def solve(state, generate):
        # TODO: dynamic prompt engineering
        return state

@task
def ctf(solver: Solver | None = None):
    # use default tool loop solver if no solver specified
    if solver is None:
        solver = ctf_tool_loop()
   
    # return task
    return Task(
        dataset=read_dataset(),
        setup=ctf_prompt(),
        solver=solver,
        sandbox="docker",
        scorer=includes()
    )

Task Cleanup

You can use the cleanup parameter for executing code at the end of each sample run. The cleanup function is passed the TaskState and is called for both successful runs and runs where are exception is thrown. Extending the example from above:

async def ctf_cleanup(state: TaskState):
    ## perform cleanup
    ...

Task(
    dataset=read_dataset(),
    setup=ctf_prompt(),
    solver=solver,
    cleanup=ctf_cleanup,
    scorer=includes()
)

Note that like solvers, cleanup functions should be async.

Task Reuse

The basic mechanism for task re-use is to create flexible and adaptable base @task functions (which often have many parameters) and then derive new higher-level tasks from them by creating additional @task functions that call the base function.

In some cases though you might not have full control over the base @task function (e.g. it’s published in a Python package you aren’t the maintainer of) but you nevertheless want to flexibly create derivative tasks from it. To do this, you can use the task_with() function, which clones an existing task and enables you to override any of the task fields that you need to.

For example, imagine you are dealing with a Task that hard-codes its sandbox to a particular Dockerfile included with the task, and further does not make a solver parameter available to swap in other solvers:

from inspect_ai import Task, task
from inspect_ai.solver import basic_agent
from inspect_ai.tool import bash
from inspect_ai.scorer import includes

@task
def hard_coded():
    return Task(
        dataset=read_dataset(),
        solver=basic_agent(tools=[bash()]),
        sandbox=("docker", "compose.yaml"),
        scorer=includes()
    )

Using task_with(), you can adapt this task to use a different solver and sandbox entirely. For example, here we import the original hard_coded() task from a hypothetical ctf_tasks package and provide it with a different solver and sandbox, as well as give it a message_limit (which we in turn also expose as a parameter of the adapted task):

from inspect_ai import task, task_with
from inspect_ai.solver import solver

from ctf_tasks import hard_coded

@solver
def my_custom_agent():
    ## custom agent implementation
    ...

@task
def adapted(message_limit: int = 20):
    return task_with(
        hard_coded(),  # original task definition
        solver=my_custom_agent(),
        sandbox=("docker", "custom-compose.yaml"),
        message_limit=message_limit
    )

Tasks are recipes for an evaluation and represent the convergence of many considerations (datasets, solvers, sandbox environments, limits, and scoring). Task variations often lie at the intersection of these, and the task_with() function is intended to help you produce exactly the variation you need for a given evaluation.

Exploratory

When developing tasks and solvers, you often want to explore how changing prompts, generation options, solvers, and models affect performance on a task. You can do this by creating multiple tasks with varying parameters and passing them all to the eval_set() function.

Returning to the example from above, the system and grader parameters point to files we are using as system message and grader model templates. At the outset we might want to explore every possible combination of these parameters, along with different models. We can use the itertools.product function to do this:

from itertools import product

# 'grid' will be a permutation of all parameters
params = {
    "system": ["devops.txt", "researcher.txt"],
    "grader": ["hacker.txt", "expert.txt"],
    "grader_model": ["openai/gpt-4o", "google/gemini-1.5-pro"],
}
grid = list(product(*(params[name] for name in params)))

# run the evals and capture the logs
logs = eval_set(
    [
        security_guide(system, grader, grader_model)
        for system, grader, grader_model in grid
    ],
    model=["google/gemini-1.5-flash", "mistral/mistral-large-latest"],
    log_dir="security-tasks"
)

# analyze the logs...
plot_results(logs)

Note that we also pass a list of model to try out the task on multiple models. This eval set will produce in total 16 tasks accounting for the parameter and model variation.

See the article on Eval Sets to learn more about using eval sets. See the article on Eval Logs for additional details on working with evaluation logs.