Agents API

Overview

This article describes advanced Inspect APIs available for creating evaluations with agents. You can also build agents evals using Inspect’s built in Basic Agent or by bridging to an external agent library (see the main Agents article for further details). Topics covered in this article include:

  1. Sharing per-sample state across solvers and tools
  2. Creating a custom tool use loop
  3. Dynamically customising tool descriptions
  4. Observability with sample transcripts.
  5. Delegating work to sub-tasks
  6. Sandboxing arbitrary code execution

We’ll assume that you have already covered the basics of Solvers, Tools, and Agents (please review those articles as required before proceeding).

Sample Store

Sequences of solvers executing against a sample often need to store and manipulate shared state. Further, tools may often want their own persistent state (or groups of tools may want to share state). This can be accomplished in Inspect using the Store, which provides a sample-scoped scratchpad for arbitrary values.

The core of the Store interface is:

from inspect_ai.util import Store

class Store:
    def get(self, key: str, default: VT) -> VT
    def set(self, key: str, value: Any) -> None
    def delete(self, key: str) -> None

Note that the core Store interface is a property bag without strong typing. See the section below on typed store access for details on how to interact with the store in a typesafe fashion.

Basic views on the store’s collection (e.g. items(), keys(), values()) are also provided. Note that the get() method will automatically add the default to the store if it doesn’t exist.

The Store can be accessed via TaskState as follows:

history = state.store.get("history", [])

It is also possible the access the Store for the current sample using the store() function. This is the mechanism for tools to read and write the Store. For example:

from inspect_ai.tool import tool
from inspect_ai.util import store

@tool
def web_browser_back():
   def execute() -> str:
       history = store().get("web_browser:history", [])
       return history.pop()

While there is no formal namespacing mechanism for the Store, this can be informally achieved using key prefixes as demonstrated above.

You should generally try to use JSON serialisable Python types in the Store (e.g. objects should be dataclasses or Pydantic BaseModel) so that they can be recorded in the Transcript.

While the default Store for a sample is shared globally between solvers and tools, a more narrowly scoped Store is created automatically for Subtasks.

Store Typing

If you prefer a typesafe interface to the sample store, you can define a Pydantic model which reads and writes values into the store. There are several benefits to using Pydantic models for store access:

  1. You can provide type annotations and validation rules for all fields.
  2. Default values for all fields are declared using standard Pydantic syntax.
  3. Store names are automatically namespaced (to prevent conflicts between multiple store accessors).

Definition

First, derive a class from StoreModel (which in turn derives from Pydantic BaseModel):

from pydantic import Field
from inspect_ai.util import StoreModel

class Activity(StoreModel):
    active: bool = Field(default=False)
    tries: int = Field(default=0)
    actions: list[str] = Field(default_factory=list)

Note that we define defaults for all fields. This is generally required so that you can initialise your Pydantic model from an empty store. For collections (list and dict) you should use default_factory so that each instance gets its own default.

Usage

Use the store_as() function to get a typesafe interface to the store based on your model:

# typed interface to store from state
activity = state.store_as(Activity)
activity.active = True
activity.tries += 1

# global store_as() function (e.g. for use from tools)
from inspect_ai.util import store_as
activity = store_as(Activity)

Note that all instances of Activity created within a running sample share the same sample Store so can see each other’s changes. For example, you can call state.store_as() in multiple solvers and/or scorers and it will resolve to the same sample-scoped instance.

The names used in the underlying Store are namespaced to prevent collisions with other Store accessors. For example, the active field in the Activity class is written to the store with the name Activity:active.

Explicit Store

The store_as() function automatically binds to the current sample Store. You can alternatively create an explicit Store and pass it directly to the model (e.g. for testing purposes):

from inspect_ai.util import Store
store = Store()
activity = Activity(store=store)

Tool Use

Custom Loop

The higher level generate() function passed to solvers includes a built-in tool use loop—when the model calls a tool, Inspect calls the underlying Python function and reports the result to the model, proceeding until the model stops calling tools. However, for more advanced agents you may want to intervene in the tool use loop in a variety of ways:

  1. Redirect the model to another trajectory if its not on a productive course.
  2. Exercise more fine grained control over which, when, and how many tool calls are made, and how tool calling errors are handled.
  3. Have multiple generate() passes each with a distinct set of tools.

To do this, create a solver that emulates the default tool use loop and provides additional customisation as required. For example, here is a complete solver agent that has essentially the same implementation as the default generate() function:

@solver
def agent_loop(message_limit: int = 50):
    async def solve(state: TaskState, generate: Generate):

        # establish messages limit so we have a termination condition
        state.message_limit = message_limit

        # call the model in a loop
        while not state.completed:
            # call model
            output = await get_model().generate(state.messages, state.tools)

            # update state
            state.output = output
            state.messages.append(output.message)

            # make tool calls or terminate if there are none
            if output.message.tool_calls:
                state.messages.extend(call_tools(output.message, state.tools))
            else:
                break

        return state

    return solve

The state.completed flag is automatically set to False if message_limit or token_limit for the task is exceeded, so we check it at the top of the loop.

You can imagine several ways you might want to customise this loop:

  1. Adding another termination condition for the output satisfying some criteria.
  2. Urging the model to keep going after it decides to stop calling tools.
  3. Examining and possibly filtering the tool calls before invoking call_tools()
  4. Adding a critique / reflection step between tool calling and generate.
  5. Forking the TaskState and exploring several trajectories.

Stop Reasons

One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason field of ModelOutput. For example:

output = await model.generate(state.messages, state.tools)
if output.stop_reason == "model_length":
    # do something to recover from context window overflow

Here are the possible values for StopReason :

Stop Reason Description
stop The model hit a natural stop point or a provided stop sequence
max_tokens The maximum number of tokens specified in the request was reached.
model_length The model’s context length was exceeded.
tool_calls The model called a tool
content_filter Content was omitted due to a content filter.
unknown Unknown (e.g. unexpected runtime error)

Error Handling

By default expected errors (e.g. file not found, insufficient permission, timeouts, output limit exceeded etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools() to state.messages (as shown above), check the error property of these messages (which will be None in the case of no error) and proceed accordingly.

Tool Filtering

Note that you don’t necessarily even need to structure the agent using a loop. For example, you might have an inner function implementing the loop, while an outer function dynamically swaps out what tools are available. For example, imagine the above was implemented in a function named tool_use_loop(), you might have outer function like this:

# first pass w/ core tools
state.tools = [decompile(), dissasemble(), bash()]
state = await tool_use_loop(state)

# second pass w/ prompt and python tool only
state.tools = [python()]
state = await tool_use_loop(state)

Taken together these APIs enable you to build a custom version of generate() with whatever structure and logic you need.

Tool Descriptions

In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).

The tool_with() function enables you to take any tool and adapt its name and/or descriptions. For example:

from inspect_ai.tool import tool_with

my_add = tool_with(
  tool=add(), 
  name="my_add",
  description="a tool to add numbers", 
  parameters={
    "x": "the x argument",
    "y": "the y argument"
  })

You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:

my_add = tool_with(add(), description="a tool to add numbers")
my_add = tool_with(add(), parameters={"x": "the x argument"})

Note that the tool_with() function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions)..

Transcripts

Transcripts provide a rich per-sample sequential view of everything that occurs during plan execution and scoring, including:

  • Model interactions (including the raw API call made to the provider).
  • Tool calls (including a sub-transcript of activitywithin the tool)
  • Changes (in JSON Patch format) to the TaskState for the Sample.
  • Scoring (including a sub-transcript of interactions within the scorer).
  • Custom info() messages inserted explicitly into the transcript.
  • Python logger calls (info level or designated custom log-level).

This information is provided within the Inspect log viewer in the Transcript tab (which sits alongside the Messages, Scoring, and Metadata tabs in the per-sample display).

Custom Info

You can insert custom entries into the transcript via the Transcipt info() method (which creates an InfoEvent). Access the transcript for the current sample using the transcript() function, for example:

from inspect_ai.log import transcript

transcript().info("here is some custom info")

Strings passed to info() will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info().

Grouping with Steps

You can create arbitrary groupings of transcript activity using the Transcript step() context manager. For example:

with transcript().step("reasoning"):
    ...
    state.store.set("next-action", next_action)

There are two reasons that you might want to create steps:

  1. Any changes to the store which occur during a step will be collected into a StoreEvent that records the changes (in JSON Patch format) that occurred.
  2. The Inspect log viewer will create a visual delineation for the step, which will make it easier to see the flow of activity within the transcript.

Subtasks

Subtasks provide a mechanism for creating isolated, re-usable units of execution. You might implement a complex tool using a subtask or might use them in a multi-agent evaluation. The main characteristics of sub-tasks are:

  1. They run in their own async coroutine.
  2. They have their own isolated Store (no access to the sample Store).
  3. They have their own isolated Transcript

To create a subtask, declare an async function with the @subtask decorator. The function can take any arguments and return a value of any type. For example:

from inspect_ai.util import Store, subtask

@subtask
async def web_search(keywords: str) -> str:
    # get links for these keywords
    links = await search_links(keywords)

    # add links to the store so they end up in the transcript
    store().set("links", links)

    # summarise the links
    return await fetch_and_summarise(links)

Note that we add links to the store not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.

Call the subtask as you would any async function:

summary = await web_search(keywords="solar power")

A few things will occur automatically when you run a subtask:

  • New isolated Store and Transcript objects will be created for the subtask (accessible via the store() and transcript() functions). Changes to the Store that occur during execution will be recorded in a StoreEvent.

  • A SubtaskEvent will be added to the current transcript. The event will include the name of the subtask, its input and results, and a transcript of all events that occur within the subtask.

You can also include one or more steps within a subtask.

Parallel Execution

You can execute subtasks in parallel using asyncio.gather(). For example, to run 3 web_search() subtasks in parallel:

import asyncio

searches = [
  web_search(keywords="solar power"),
  web_search(keywords="wind power"),
  web_search(keywords="hydro power"),
]

results = await asyncio.gather(*searches)

Note that we don’t await the subtasks when building up our list of searches. Rather, we let asyncio.gather() await all of them, returning only when all of the results are available.

Forking

Inspect’s fork() function provids a convenient wrapper around a very common use of subtasks: running a TaskState against a set of solvers in parallel to explore different trajectories.

For example, let’s say you have a solver named explore() that takes temperature as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:

from inspect_ai.solver import fork

results = await fork(state, [
    explore(temperature = 0.5),
    explore(temperature = 0.75),
    explore(temperature = 1.0)
])

The state will be deep copied so that each explore() solver instance gets it own copy of the state to work on. The results contain a list of TaskState with the value returned from each of the solvers.

Sandboxing

Many agents provide models with the ability to execute arbitrary code. It’s important that this code be sandboxed so that it executes in an isolated context. Inspect supports this through the SandboxEnvironment (which in turn may be implemented using Docker or various other schemes). Enable sandboxing for a task with the sandbox parameter. For example:

@task
def file_probe()
    return Task(
        dataset=dataset,
        solver=[
            use_tools([list_files()]), 
            generate()
        ],
        sandbox="docker",
        scorer=includes(),
    )
)

Use the SandboxEnvironment within a tool via the sandbox() function. For example, here’s an implementation of the list_files() tool referenced above:

from inspect_ai.tool import ToolError, tool
from inspect_ai.util import sandbox

@tool
def list_files():
    async def execute(dir: str):
        """List the files in a directory.

        Args:
            dir (str): Directory

        Returns:
            File listing of the directory
        """
        result = await sandbox().exec(["ls", dir])
        if result.success:
            return result.stdout
        else:
            raise ToolError(result.stderr)

    return execute

See the section on Sandboxing for further details on using sandboxes with Inspect.