Agents API
Overview
This article describes advanced Inspect APIs available for creating evaluations with agents. You can also build agents evals using Inspect’s built in Basic Agent or by bridging to an external agent library (see the main Agents article for further details). Topics covered in this article include:
- Sharing state across solvers and tools
- Creating a custom tool use loop
- Dynamically customising tool descriptions
- Observability with sample transcripts.
- Delegating work to sub-tasks
- Sandboxing arbitrary code execution
We’ll assume that you have already covered the basics of Solvers, Tools, and Agents (please review those articles as required before proceeding).
Use of metadata
Before proceeding, it’s important to point that some of the features described below were previously approximated by using the metadata
field of TaskState
, specifically metadata
was often used as a catch-all storage location for:
- Sharing state between solvers.
- Providing a place to log additional structured data.
- Recording calls to “helper” models used for elicitation or scoring.
The metadata
field no longer need be used for these scenarios (and in fact should now be treated as a read-only part of the TaskState
). Below we’ll describe how the Store
can be used for state, how structured data can be logged to the sample Transcript
, and how all model calls are now automatically recorded and included in the transcript.
Tool Use
Custom Loop
The higher level generate()
function passed to solvers includes a built-in tool use loop—when the model calls a tool, Inspect calls the underlying Python function and reports the result to the model, proceeding until the model stops calling tools. However, for more advanced agents you may want to intervene in the tool use loop in a variety of ways:
- Redirect the model to another trajectory if its not on a productive course.
- Exercise more fine grained control over which, when, and how many tool calls are made, and how tool calling errors are handled.
- Have multiple
generate()
passes each with a distinct set of tools.
To do this, create a solver that emulates the default tool use loop and provides additional customisation as required. For example, here is a complete solver agent that has essentially the same implementation as the default generate()
function:
@solver
def agent_loop(message_limit: int = 50):
async def solve(state: TaskState, generate: Generate):
# establish messages limit so we have a termination condition
= message_limit
state.message_limit
# call the model in a loop
while not state.completed:
# call model
= await get_model().generate(state.messages, state.tools)
output
# update state
= output
state.output
state.messages.append(output.message)
# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))else:
break
return state
return solve
The state.completed
flag is automatically set to False
if message_limit
or token_limit
for the task is exceeded, so we check it at the top of the loop.
You can imagine several ways you might want to customise this loop:
- Adding another termination condition for the output satisfying some criteria.
- Urging the model to keep going after it decides to stop calling tools.
- Examining and possibly filtering the tool calls before invoking
call_tools()
- Adding a critique / reflection step between tool calling and generate.
- Forking the
TaskState
and exploring several trajectories.
Stop Reasons
One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason
field of ModelOutput
. For example:
= await model.generate(state.messages, state.tools)
output if output.stop_reason == "model_length":
# do something to recover from context window overflow
Here are the possible values for StopReason
:
Stop Reason | Description |
---|---|
stop |
The model hit a natural stop point or a provided stop sequence |
max_tokens |
The maximum number of tokens specified in the request was reached. |
model_length |
The model’s context length was exceeded. |
tool_calls |
The model called a tool |
content_filter |
Content was omitted due to a content filter. |
unknown |
Unknown (e.g. unexpected runtime error) |
Error Handling
By default expected errors (e.g. file not found, insufficient permission, timeouts, output limit exceeded etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
Note that you don’t necessarily even need to structure the agent using a loop. For example, you might have an inner function implementing the loop, while an outer function dynamically swaps out what tools are available. For example, imagine the above was implemented in a function named tool_use_loop()
, you might have outer function like this:
# first pass w/ core tools
= [decompile(), dissasemble(), bash()]
state.tools = await tool_use_loop(state)
state
# second pass w/ prompt and python tool only
= [python()]
state.tools = await tool_use_loop(state) state
Taken together these APIs enable you to build a custom version of generate()
with whatever structure and logic you need.
Tool Descriptions
In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).
The tool_with()
function enables you to take any tool and adapt its name and/or descriptions. For example:
from inspect_ai.tool import tool_with
= tool_with(
my_add =add(),
tool="my_add",
name="a tool to add numbers",
description={
parameters"x": "the x argument",
"y": "the y argument"
})
You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:
= tool_with(add(), description="a tool to add numbers")
my_add = tool_with(add(), parameters={"x": "the x argument"}) my_add
Note that the tool_with()
function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions)..
Transcripts
Transcripts provide a rich per-sample sequential view of everything that occurs during plan execution and scoring, including:
- Model interactions (including the raw API call made to the provider).
- Tool calls (including a sub-transcript of activitywithin the tool)
- Changes (in JSON Patch format) to the
TaskState
for theSample
. - Scoring (including a sub-transcript of interactions within the scorer).
- Custom
info()
messages inserted explicitly into the transcript. - Python logger calls (
info
level or designated customlog-level
).
This information is provided within the Inspect log viewer in the Transcript tab (which sits alongside the Messages, Scoring, and Metadata tabs in the per-sample display).
Custom Info
You can insert custom entries into the transcript via the Transcipt info()
method (which creates an InfoEvent
). Access the transcript for the current sample using the transcript()
function, for example:
from inspect_ai.log import transcript
"here is some custom info") transcript().info(
Strings passed to info()
will be rendered as markdown. In addition to strings you can also pass arbitrary JSON serialisable objects to info()
.
Grouping with Steps
You can create arbitrary groupings of transcript activity using the Transcript step()
context manager. For example:
with transcript().step("reasoning"):
...set("next-action", next_action) state.store.
There are two reasons that you might want to create steps:
- Any changes to the store which occur during a step will be collected into a
StoreEvent
that records the changes (in JSON Patch format) that occurred. - The Inspect log viewer will create a visual delineation for the step, which will make it easier to see the flow of activity within the transcript.
Subtasks
Subtasks provide a mechanism for creating isolated, re-usable units of execution. You might implement a complex tool using a subtask or might use them in a multi-agent evaluation. The main characteristics of sub-tasks are:
- They run in their own async coroutine.
- They have their own isolated
Store
(no access to the sampleStore
). - They have their own isolated
Transcript
To create a subtask, declare an async function with the @subtask
decorator. The function can take any arguments and return a value of any type. For example:
from inspect_ai.util import Store, subtask
@subtask
async def web_search(keywords: str) -> str:
# get links for these keywords
= await search_links(keywords)
links
# add links to the store so they end up in the transcript
set("links", links)
store().
# summarise the links
return await fetch_and_summarise(links)
Note that we add links
to the store
not because we strictly need to for our implementation, but because we want the links to be recorded as part of the transcript.
Call the subtask as you would any async function:
= await web_search(keywords="solar power") summary
A few things will occur automatically when you run a subtask:
New isolated
Store
andTranscript
objects will be created for the subtask (accessible via thestore()
andtranscript()
functions). Changes to theStore
that occur during execution will be recorded in aStoreEvent
.A
SubtaskEvent
will be added to the current transcript. The event will include the name of the subtask, its input and results, and a transcript of all events that occur within the subtask.
You can also include one or more steps within a subtask.
Parallel Execution
You can execute subtasks in parallel using asyncio.gather()
. For example, to run 3 web_search()
subtasks in parallel:
import asyncio
= [
searches ="solar power"),
web_search(keywords="wind power"),
web_search(keywords="hydro power"),
web_search(keywords
]
= await asyncio.gather(*searches) results
Note that we don’t await
the subtasks when building up our list of searches
. Rather, we let asyncio.gather()
await all of them, returning only when all of the results are available.
Forking
Inspect’s fork()
function provids a convenient wrapper around a very common use of subtasks: running a TaskState
against a set of solvers in parallel to explore different trajectories.
For example, let’s say you have a solver named explore()
that takes temperature
as a parameter. You might want to try the solver out with multiple temperature values and then continue on with the best result:
from inspect_ai.solver import fork
= await fork(state, [
results = 0.5),
explore(temperature = 0.75),
explore(temperature = 1.0)
explore(temperature ])
The state
will be deep copied so that each explore()
solver instance gets it own copy of the state
to work on. The results
contain a list of TaskState
with the value returned from each of the solvers.
Sandboxing
Many agents provide models with the ability to execute arbitrary code. It’s important that this code be sandboxed so that it executes in an isolated context. Inspect supports this through the SandboxEnvironment
(which in turn may be implemented using Docker or various other schemes). Enable sandboxing for a task with the sandbox
parameter. For example:
@task
def file_probe()
return Task(
=dataset,
dataset=[
solver
use_tools([list_files()]),
generate()
],="docker",
sandbox=includes(),
scorer
) )
Use the SandboxEnvironment
within a tool via the sandbox()
function. For example, here’s an implementation of the list_files()
tool referenced above:
from inspect_ai.tool import ToolError, tool
from inspect_ai.util import sandbox
@tool
def list_files():
async def execute(dir: str):
"""List the files in a directory.
Args:
dir (str): Directory
Returns:
File listing of the directory
"""
= await sandbox().exec(["ls", dir])
result if result.success:
return result.stdout
else:
raise ToolError(result.stderr)
return execute
See the section on Sandbox Environments for further details on using sandboxes with Inspect.