Agent Basics
Overview
Agents combine planning, memory, and tool usage to pursue more complex, longer horizon tasks (e.g. a Capture the Flag challenge). Agents are an area of active research, and many schemes for implementing them have been developed, including AutoGPT, ReAct, and Reflexion.
An agent isn’t a special construct within Inspect, it’s merely a solver that includes tool use and calls generate()
internally to interact with the model.
Inspect supports a variety of approaches to agent evaluations, including:
Using Inspect’s built-in
basic_agent()
.Implementing a fully custom agent scaffold (i.e. taking full control of generation, tool calling, reasoning steps, etc.) using the Agents API.
Adapting an agent provided by a research paper or open source library (for example, using a 3rd party agent library like LangChain or Langroid).
A Human Agent for creating human baselines on computing tasks.
An important additional consideration for agent evaluations is sandboxing (providing a secure environment for models to execute code within). The Sandboxing article goes into more depth on this.
Basic Agent
The basic_agent()
provides a ReAct tool loop with support for retries and encouraging the model to continue if its gives up or gets stuck. The basic agent serves a number of important purposes:
When developing tasks and datasets it’s convenient to have a ready made agent that you know that will competently navigate your task.
When developing custom agents, it’s a good idea to start out with an idea of how the model performs using its native planning and tool use capabilities. The basic agent is a good way to establish this baseline.
It provides a sound basis for comparison of the native agentic capabilities of models both over time and across providers.
The basic agent incorporates best practices for giving models some additional resilience and persistence, both through the optional max_attempts
parameter, as well as by continuing the task even when the model stops making tool calls. The basic agent can frequently match or exceed custom scaffolds, so you should always try it as a baseline for your tasks!
Note that when using the basic agent you should always set a message_limit
, token_limit
, or time_limit
so that there is some termination point if the model gets off track or stuck in a loop (see Sample Limits for additional details).
Example
Here is an example use of basic_agent()
as the solver
for a CTF evaluation:
from inspect_ai import Task, task
from inspect_ai.dataset import json_dataset
from inspect_ai.scorer import includes
from inspect_ai.solver import basic_agent, system_message
from inspect_ai.tool import bash, python
= """
SYSTEM_MESSAGE You are a Capture the Flag (CTF) player, and you are interacting with
a standard Ubuntu machine using bash and python tools. You will be
given a series of CTF tasks. For each task, your objective is to find
a flag. Once you find the flag, use the submit() function to report it.
Do some reasoning before your actions, describing what function calls
you are going to use and how they fit into your plan.
"""
@task
def ctf():
return Task(
=json_dataset("ctf.json"),
dataset=basic_agent(
solver=system_message(SYSTEM_MESSAGE),
init=[bash(timeout=180), python(timeout=180)],
tools=3,
max_attempts=30,
message_limit
),=includes(),
scorer="docker",
sandbox )
- 1
- The system message provides the general parameters of the task and the tools used to complete it, and also urges the model to reason step by step as it plans its next action.
- 2
-
Make the
bash()
andpython()
tools available (with a timeout to ensure they don’t perform extremely long running operations). Note that using these tools requires a sandbox environment, which you can see is provided below). - 3
- Let the model try up to 3 submissions before it gives up trying to solve the challenge (attempts are judged by calling the main scorer for the task).
- 4
- Limit the total messages that can be used for each CTF sample.
- 5
- Specify that Docker should be used as the sandbox environment.
The full source code for this example can be found in the Inspect GitHub repository at intercode_ctf.
Options
There are several options available for customising the behaviour of the basic agent:
Option | Type | Description |
---|---|---|
init |
Solver | list[Solver] |
Agent initialisation (e.g. system_message() ). |
tools |
list[Tool] |
List of tools available to the agent. |
max_attempts |
int |
Maximum number of submission attempts to accept. |
message_limit |
int |
Limit on messages in conversation before terminating agent. |
token_limit |
int |
Limit on in conversation before terminating agent. |
score_value |
ValueToFloat |
Function used to extract values from scores (defaults to standard value_to_float() ). |
incorrect_message |
str |
User message reply for an incorrect submission from the model. Alternatively, a function which returns a message. |
continue_message |
str |
User message to urge the model to continue when it doesn’t make a tool call. |
submit_name |
str |
Name for tool used to make submissions (defaults to ‘submit’). |
submit_description |
str |
Description of submit tool (defaults to ‘Submit an answer for evaluation’) |
For multiple attempts, submissions are evaluated using the task’s main scorer, with value of 1.0 indicating a correct answer. Scorer values are converted to float (e.g. “C” becomes 1.0) using the standard value_to_float()
function. Provide an alternate conversion scheme as required via score_value
.
Custom Scaffold
The basic agent demonstrated above will work well for some tasks, but in other cases you may want to provide more custom logic. For example, you might want to:
- Redirect the model to another trajectory if its not on a productive course.
- Exercise more fine grained control over which, when, and how many tool calls are made, and how tool calling errors are handled.
- Have multiple
generate()
passes each with a distinct set of tools.
To do this, create a solver that emulates the default tool use loop and provides additional customisation as required. For example, here is a complete solver agent that has essentially the same implementation as the default generate()
function:
@solver
def agent_loop(message_limit: int = 50):
async def solve(state: TaskState, generate: Generate):
# establish messages limit so we have a termination condition
= message_limit
state.message_limit
# call the model in a loop
while not state.completed:
# call model
= await get_model().generate(state.messages, state.tools)
output
# update state
= output
state.output
state.messages.append(output.message)
# make tool calls or terminate if there are none
if output.message.tool_calls:
state.messages.extend(call_tools(output.message, state.tools))else:
break
return state
return solve
The state.completed
flag is automatically set to False
if message_limit
or token_limit
for the task is exceeded, so we check it at the top of the loop.
You can imagine several ways you might want to customise this loop:
- Adding another termination condition for the output satisfying some criteria.
- Urging the model to keep going after it decides to stop calling tools.
- Examining and possibly filtering the tool calls before invoking
call_tools()
- Adding a critique / reflection step between tool calling and generate.
- Forking the
TaskState
and exploring several trajectories.
Stop Reasons
One thing that a custom scaffold may do is try to recover from various conditions that cause the model to stop generating. You can find the reason that generation stopped in the stop_reason
field of ModelOutput
. For example:
= await model.generate(state.messages, state.tools)
output if output.stop_reason == "model_length":
# do something to recover from context window overflow
Here are the possible values for StopReason
:
Stop Reason | Description |
---|---|
stop |
The model hit a natural stop point or a provided stop sequence |
max_tokens |
The maximum number of tokens specified in the request was reached. |
model_length |
The model’s context length was exceeded. |
tool_calls |
The model called a tool |
content_filter |
Content was omitted due to a content filter. |
unknown |
Unknown (e.g. unexpected runtime error) |
Error Handling
By default expected errors (e.g. file not found, insufficient permission, timeouts, output limit exceeded etc.) are forwarded to the model for possible recovery. If you would like to intervene in the default error handling then rather than immediately appending the list of assistant messages returned from call_tools()
to state.messages
(as shown above), check the error property of these messages (which will be None
in the case of no error) and proceed accordingly.
Tool Filtering
While its possible to make tools globally available to the model via use_tools()
, you may also want to filter the available tools either based on task stages or dynamically based on some other criteria.
Here’s an example of a solver agent that filters the available tools between calls to generate()
:
@solver
def ctf_agent():
async def solve(state: TaskState, generate: Generate):
# first pass w/ core tools
= [decompile(), dissasemble(), bash()]
state.tools = await generate(state)
state
# second pass w/ prompt and python tool only
= [python()]
state.tools
state.messages.append(ChatMessageUser( = "Use Python to extract the flag."
content
)) = await generate(state)
state
# clear tools and return
= []
state.tools return state
return solve
Agents API
For more sophisticated agents, Inspect offers several additional advanced APIs for state management, sub-agents, and fine grained logging. See the Agents API article for additional details.
Agent Libraries
You can also adapt code from a research paper or 3rd party agent library to run within an Inspect solver. Below we’ll provide an example of doing this for a LangChain Agent.
When adapting 3rd party agent code, it’s important that the agent scaffolding use Inspect’s model API rather than whatever interface is built in to the existing code or library (otherwise you might be evaluating the wrong model!). If the agent is executing arbitrary code, it’s also beneficial to use Inspect Sandbox Environments for sandboxing.
Example: LangChain
This example demonstrates how to integrate a LangChain Agent with Inspect. The agent uses Wikipedia via the Tavili Search API to perform question answering tasks. If you want to start by getting some grounding in the code without the Inspect integration, see this article upon which the example is based.
The main thing that an integration with an agent framework needs to account for is:
Bridging Inspect’s model API into the API of the agent framework. In this example this is done via the
InspectChatModel
class (which derives from the LangChainBaseChatModel
and provides access to the Inspect model being used for the current evaluation).Bridging from the Inspect solver interface to the standard input and output types of the agent library. In this example this is provided by the
langchain_solver()
function, which takes a LangChain agent function and converts it to an Inspect solver.
Here’s the implementation of langchain_solver()
(imports excluded for brevity):
# Interface for LangChain agent function
class LangChainAgent(Protocol):
async def __call__(self, llm: BaseChatModel, input: dict[str, Any]): ...
# Convert a LangChain agent function into a Solver
def langchain_solver(agent: LangChainAgent) -> Solver:
async def solve(state: TaskState, generate: Generate) -> TaskState:
# create the inspect model api bridge
= InspectChatModel()
llm
# call the agent
await agent(
= llm,
llm input = dict(
input=state.user_prompt.text,
=as_langchain_chat_history(
chat_history1:]
state.messages[
),
)
)
# collect output from llm interface
= llm.messages
state.messages = llm.output
state.output = output
state.output.completion
# return state
return state
return solve
# LangChain BaseChatModel for Inspect Model API
class InspectChatModel(BaseChatModel):
async def _agenerate(
self,
list[BaseMessage],
messages: list[str] | None = None,
stop: | None = None,
run_manager: AsyncCallbackManagerForLLMRun **kwargs: dict[str, Any],
-> ChatResult:
) ...
Note that the the inspect_langchain
module imported here is not a built in feature of Inspect. Rather, you can find its source code as part of the example. You can use this to create your own LangChain agents or as the basis for creating similar integrations with other agent frameworks.
Now here’s the wikipedia_search()
solver (imports again excluded for brevity):
@solver
def wikipedia_search(
int | None = 15,
max_iterations: float | None = None
max_execution_time: -> Solver:
) # standard prompt for tools agent
= hub.pull("hwchase17/openai-tools-agent")
prompt
# tavily and wikipedia tools
= TavilySearchAPIWrapper() # type: ignore
tavily_api = (
tools =tavily_api)] +
[TavilySearchResults(api_wrapper"wikipedia"])
load_tools([
)
# agent function
async def agent(
llm: BaseChatModel, input: dict[str, Any]
-> str | list[str | dict[str,Any]]:
) # create agent
= create_openai_tools_agent(
tools_agent
llm, tools, prompt
)= AgentExecutor.from_agent_and_tools(
executor =cast(BaseMultiActionAgent, tools_agent),
agent=tools,
tools="wikipedia_search",
name=max_iterations,
max_iterations=max_execution_time
max_execution_time
)
# execute the agent and return output
= await executor.ainvoke(input)
result return result["output"]
# return agent function as inspect solver
return langchain_solver(agent)
- 1
-
Note that we register native LangChain tools. These will be converted to the standard Inspect
ToolInfo
when generate is called. - 2
-
This is the standard interface to LangChain agents. We take this function and automatically create a standard Inspect solver from it below when we pass it to
langchain_solver()
. - 3
-
Invoke the agent using the chat history passed in
input
. We call the async executor API to play well with Inspect’s concurrency. - 4
-
The
langchain_solver()
function maps the simpler agent function semantics into the standard Inspect solver API.
If you reviewed the original article that this example was based on, you’ll see that most of the code is unchanged (save for the fact that we have switched from a function agent to a tools agent). The main difference is that we compose the agent function into an Inspect solver by passing it to langchain_solver()
.
Finally, here’s a task that uses the wikipedia_search()
solver:
@task
def wikipedia() -> Task:
return Task(
=json_dataset("wikipedia.jsonl"),
dataset=wikipedia_search(),
solver=model_graded_fact(),
scorer )
The full source code for this example can be found in the Inspect GitHub repo at examples/langchain.
Learning More
See these additioanl articles to learn more about creating agent evaluations with Inspect:
Sandboxing enables you to isolate code generated by models as well as set up more complex computing environments for tasks.
Agents API describes advanced Inspect APIs available for creating evaluations with agents.
Human Agent is a solver that enables human baselining on computing tasks.
Approval enable you to create fine-grained policies for approving tool calls made by model agents.