Tools
Overview
Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks.
Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools (bash, python, and web_search).
One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. This is covered in more depth in the Agents section.
Built-In Tools
Inspect has several built-in tools, including:
Bash and Python for executing arbitrary shell and Python code.
Web Browser, which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions.
Web Search, which uses the Google Search API to execute and summarise web searches.
If you are only interested in using the built-in tools, check out their respective documentation links above. To learn more about creating your own tools read on immediately below.
Tool Basics
To demonstrate the use of tools, we’ll define a simple tool that adds two numbers, using the @tool
decorator to register it with the system.
@tool
def add():
async def execute(x: int, y: int):
"""
Add two numbers.
Args:
x: First number to add.
y: Second number to add.
Returns:
The sum of the two numbers.
"""
return x + y
return execute
Annotations
Note that we provide type annotations for both arguments:
async def execute(x: int, y: int)
Further, we provide descriptions for each parameter in the documention comment:
Args:
x: First number to add. y: Second number to add.
Type annotations and descriptions are required for tool declarations so that the model can be informed which types to pass back to the tool function and what the purpose of each parameter is.
Note that you while you are required to provide default descriptions for tools and their parameters within doc comments, you can also make these dynamically customisable by users of your tool (see the section below on Tool Descriptions for details on how to do this).
Using Tools
We can use this tool in an evaluation by passing it to the use_tools()
Solver:
@task
def addition_problem():
return Task(
=[Sample(input="What is 1 + 1?", target=["2"])],
dataset=[
solver
use_tools(add()),
generate()
],=match(numeric=True),
scorer )
Note that this tool doesn’t make network requests or do heavy computation, so is fine to run as inline Python code. If your tool does do more elaborate things, you’ll want to make sure it plays well with Inspect’s concurrency scheme. For network requests, this amounts to using async
HTTP calls with httpx
. For heavier computation, tools should use subprocesses as described in the next section.
Note that when using tools with models, the models do not call the Python function directly. Rather, the model generates a structured request which includes function parameters, and then Inspect calls the function and returns the result to the model.
Tool Errors
Various errors can occur during tool execution, especially when interacting with the file system or network or when using Sandbox Environments to execute code in a container sandbox. As a tool writer you need to decide how you’d like to handle error conditions. A number of approaches are possible:
Notify the model that an error occurred to see whether it can recover.
Catch and handle the error internally (trying another code path, etc.).
Allow the error to propagate, resulting in the current
Sample
failing with an error state.
There are no universally correct approaches as tool usage and semantics can vary widely—some rough guidelines are provided below.
Default Handling
If you do not explicitly handle errors, then Inspect provides some default error handling behaviour. Specifically, if any of the following errors are raised they will be handled and reported to the model:
TimeoutError
— Occurs when a call tosubprocess()
orsandbox().exec()
times out.PermissionError
— Occurs when there are inadequate permissions to read or write a file.UnicodeDecodeError
— Occurs when the output from executing a process or reading a file is binary rather than text.OutputLimitExceededError
- Occurs when one or both of the output streams fromsandbox().exec()
exceed 10 MiB or when attempting to read a file over 100 MiB in size.ToolError
— Special error thrown by tools to indicate they’d like to report an error to the model.
These are all errors that are expected (in fact the SandboxEnvironment
interface documents them as such) and possibly recoverable by the model (try a different command, read a different file, etc.). Unexpected errors (e.g. a network error communicating with a remote service or container runtime) on the other hand are not automatically handled and result in the Sample
failing with an error.
Many tools can simply rely on the default handling to provide reasonable behaviour around both expected and unexpected errors.
When we say that the errors are reported directly to the model, this refers to the behaviour when using the default generate()
. If on the other hand, you are have created custom scaffolding for an agent, you can intercept tool errors and apply additional filtering and logic.
Explicit Handling
In some cases a tool can implement a recovery strategy for error conditions. For example, an HTTP request might fail due to transient network issues, and retrying the request (perhaps after a delay) may resolve the problem. Explicit error handling strategies are generally applied when there are expected errors that are not already handled by Inspect’s Default Handling.
Another type of explicit handling is re-raising an error to bypass Inspect’s default handling. For example, here we catch at re-raise TimeoutError
so that it fails the Sample
:
try:
= await sandbox().exec(
result =["decode", file],
cmd=timeout
timeout
)except TimeoutError:
raise RuntimeError("Decode operation timed out.")
Sandboxing
Tools may have a need to interact with a sandboxed environment (e.g. to provide models with the ability to execute arbitrary bash or python commands). The active sandbox environment can be obtained via the sandbox()
function. For example:
from inspect_ai.tool import ToolError, tool
from inspect_ai.util import sandbox
@tool
def list_files():
async def execute(dir: str):
"""List the files in a directory.
Args:
dir (str): Directory
Returns:
File listing of the directory
"""
= await sandbox().exec(["ls", dir])
result if result.success:
return result.stdout
else:
raise ToolError(result.stderr)
return execute
The following instance methods are available to tools that need to interact with a SandboxEnvironment
:
class SandboxEnvironment:
async def exec(
self,
list[str],
cmd: input: str | bytes | None = None,
str | None = None,
cwd: dict[str, str] = {},
env: str | None = None,
user: int | None = None,
timeout: -> ExecResult[str]:
) """
Raises:
TimeoutError: If the specified `timeout` expires.
UnicodeDecodeError: If an error occurs while
decoding the command output.
PermissionError: If the user does not have
permission to execute the command.
OutputLimitExceededError: If an output stream
exceeds the 10 MiB limit.
"""
...
async def write_file(
self, file: str, contents: str | bytes
-> None:
) """
Raises:
PermissionError: If the user does not have
permission to write to the specified path.
IsADirectoryError: If the file exists already and
is a directory.
"""
...
async def read_file(
self, file: str, text: bool = True
-> Union[str | bytes]:
) """
Raises:
FileNotFoundError: If the file does not exist.
UnicodeDecodeError: If an encoding error occurs
while reading the file.
(only applicable when `text = True`)
PermissionError: If the user does not have
permission to read from the specified path.
IsADirectoryError: If the file is a directory.
OutputLimitExceededError: If the file size
exceeds the 100 MiB limit.
"""
...
async def connection(self) -> SandboxConnection:
"""
Raises:
NotImplementedError: For sandboxes that don't provide connections
ConnectionError: If sandbox is not currently running.
"""
The read_file()
method should should preserve newline constructs (e.g. crlf should be preserved not converted to lf). This is equivalent to specifying newline=""
in a call to the Python open()
function. Note that write_file()
automatically creates parent directories as required if they don’t exist.
The connection()
method is optional, and provides commands that can be used to login to the sandbox container from a terminal or IDE.
For each method there is a documented set of errors that are raised: these are expected errors and can either be caught by tools or allowed to propagate in which case they will be reported to the model for potential recovery. In addition, unexpected errors may occur (e.g. a networking error connecting to a remote container): these errors are not reported to the model and fail the Sample
with an error state.
See the documentation on Sandbox Environments for additional details.
Tool Choice
By default models will use a tool if they think it’s appropriate for the given task. You can override this behaviour using the tool_choice
parameter of the use_tools()
Solver. For example:
# let the model decide whether to use the tool
="auto")
use_tools(addition(), tool_choice
# force the use of a tool
=ToolFunction(name="addition"))
use_tools(addition(), tool_choice
# prevent use of tools
="none") use_tools(addition(), tool_choice
The last form (tool_choice="none"
) would typically be used to turn off tool usage after an initial generation where the tool used. For example:
= [
solver =ToolFunction(name="addition")),
use_tools(addition(), tool_choice
generate(),
follow_up_prompt(),="none"),
use_tools(tool_choice
generate() ]
Tool Descriptions
Well crafted tools should include descriptions that provide models with the context required to use them correctly and productively. If you will be developing custom tools it’s worth taking some time to learn how to provide good tool definitions. Here are some resources you may find helpful:
In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).
The tool_with()
function enables you to take any tool and adapt its name and/or descriptions. For example:
from inspect_ai.tool import tool_with
= tool_with(
my_add =add(),
tool="my_add",
name="a tool to add numbers",
description={
parameters"x": "the x argument",
"y": "the y argument"
})
You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:
= tool_with(add(), description="a tool to add numbers")
my_add = tool_with(add(), parameters={"x": "the x argument"}) my_add
Note that the tool_with()
function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions).
Dynamic Tools
As described above, normally tools are defined using @tool
decorators and documentation comments. It’s also possible to create a tool dynamically from any function by creating a ToolDef
. For example:
from inspect_ai.solver import use_tools
from inspect_ai.tool import ToolDef
async def addition(x: int, y: int):
return x + y
= ToolDef(
add =addition,
tool="add",
name="A tool to add numbers",
description={
parameters"x": "the x argument",
"y": "the y argument"
})
)
use_tools([add])
This is effectively what happens under the hood when you use the @tool
decorator. There is one critical requirement for functions that are bound to tools using ToolDef
: type annotations must be provided in the function signature (e.g. x: int, y: int
).
For Inspect APIs, ToolDef
can generally be used anywhere that Tool
can be used (use_tools()
, setting state.tools
, etc.). If you are using a 3rd party API that does not take Tool
in its interface, use the ToolDef.as_tool()
method to adapt it. For example:
from inspect_agents import my_agent
= my_agent(tools=[add.as_tool()]) agent
If on the other hand you want to get the ToolDef
for an existing tool (e.g. to discover its name, description, and parameters) you can just pass the Tool
to the ToolDef
constructor (including whatever overrides for name
, etc. you want):
from inspect_ai.tool import ToolDef, bash
= ToolDef(bash()) bash_def
Parallel Tool Calls
Models will often provide multiple tool calls to evaluate. By default, Inspect executes these tool calls in parallel. While this can provide a performance improvement, it might not be compatible with semantics of some tools (for example, if they manage some global state between calls).
You can opt-out of parallel tool calling by adding parallel=False
to the @tool
decorator. For example, the built in web browsing tools do this as follows:
@tool(parallel=False)
def web_browser_go() -> Tool:
...
Specifying parallel=False
results in two behaviours:
Models that support turning off parallel tool calling (currently OpenAI and Grok) will have it disabled when tools with
parallel=False
are passed togenerate()
.Inspect will execute tool calls serially (so that even for models that don’t let you disable parallel tool calling, you can still be assured they will not ever run in parallel).
Bash and Python
The bash()
and python()
tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of a Sandbox Environment for the execution of untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:
from inspect_ai.tool import bash, python
= 180
CMD_TIMEOUT
@task
def intercode_ctf():
return Task(
=read_dataset(),
dataset=[
solver"system.txt"),
system_message(
use_tools([
bash(CMD_TIMEOUT),
python(CMD_TIMEOUT)
]),
generate(),
],=includes(),
scorer=30,
message_limit="docker",
sandbox )
We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.
See the Agents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.
Web Browser
The web browser tools provides models with the ability to browse the web using a headless Chromium browser. Navigation, history, and mouse/keyboard interactions are all supported.
Configuration
Under the hood, the web browser is an instance of Chromium orchestrated by Playwright, and runs in its own dedicated Docker container. Therefore, to use the web_browser tool you should reference the aisiuk/inspect-web-browser-tool
Docker image in your compose.yaml
. For example, here we use it as our default image:
compose.yaml
services:
default:
image: aisiuk/inspect-web-browser-tool
init: true
Here, we add a dedicated web_browser
service:
compose.yaml
services:
default:
image: "python:3.12-bookworm"
init: true
command: "tail -f /dev/null"
web_browser:
image: aisiuk/inspect-web-browser-tool
init: true
Rather than using the aisiuk/inspect-web-browser-tool
image, you can also just include the web browser service components in a custom image (see Custom Images below for details).
Task Setup
A task configured to use the web browser tools might look like this:
from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python, web_browser
@task
def browser_task():
return Task(
=read_dataset(),
dataset=[
solver+ web_browser()),
use_tools([bash(), python()]
generate(),
],=match(),
scorer=("docker", "compose.yaml"),
sandbox )
Note that unlike some other tool functions like bash()
, the web_browser()
function returns a list of tools. Therefore, we concatenate it with a list of the other tools we are using in the call to use_tools()
.
Browsing
If you review the transcripts of a sample with access to the web browser tool, you’ll notice that there are several distinct tools made available for control of the web browser. These tools include:
Tool | Description |
---|---|
web_browser_go(url) |
Navigate the web browser to a URL. |
web_browser_click(element_id) |
Click an element on the page currently displayed by the web browser. |
web_browser_type(element_id) |
Type text into an input on a web browser page. |
web_browser_type_submit(element_id, text) |
Type text into a form input on a web browser page and press ENTER to submit the form. |
web_browser_scroll(direction) |
Scroll the web browser up or down by one page. |
web_browser_forward() |
Navigate the web browser forward in the browser history. |
web_browser_back() |
Navigate the web browser back in the browser history. |
web_browser_refresh() |
Refresh the current page of the web browser. |
The return value of each of these tools is a web accessibility tree for the page, which provides a clean view of the content, links, and form fields available on the page (you can look at the accessibility tree for any web page using Chrome Developer Tools).
Disabling Interactions
You can use the web browser tools with page interactions disabled by specifying interactive=False
, for example:
=False)) use_tools(web_browser(interactive
In this mode, the interactive tools (web_browser_click()
, web_browser_type()
, and web_browser_type_submit()
) are not made available to the model.
Custom Images
Above we demonstrated how to use the pre-configured Inspect web browser container. If you prefer to incorporate the headless web browser and its dependencies into another container that is also supported.
To do this, reference the Dockerfile used in the built-in web browser container and ensure that the dependencies, application files, and server run command it uses are also in your container definition:
# Install playwright
RUN pip install playwright
RUN playwright install
RUN playwright install-deps
# Install other dependencies
RUN pip install dm-env-rpc pillow bs4 lxml
# Copy Python files alongside the Dockerfile
COPY *.py ./
# Run the server
CMD ["python3", "/app/web_browser/web_server.py"]
Note that all of the Python files in the _resources directory alongside the Dockerfile
need to be available for copying when building the container.
Web Search
The web_search()
tool provides models the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the web_search()
function:
def web_search(
"google"] = "google",
provider: Literal[int = 3,
num_results: int = 3,
max_provider_calls: int = 10,
max_connections: str | Model | None = None,
model: -> Tool:
) ...
You can use the web_search()
tool like this:
from inspect_ai.tool import web_search
=[
solver
use_tools(web_search()),
generate() ],
Web search options include:
provider
—Web search provider (currently only Google is supported, see below for instructions on setup and configuration for Google).num_results
—How many search results to return to the main model (defaults to 5).max_provider_calls
—Number of times to retrieve more links from the search provider in case previous ones were irrelevant (defaults to 3).max_connections
—Maximum number of concurrent connections to the search API provider (defaults to 10).model
—Model to use to determine if search results are relevant (defaults to the model currently being evaluated).
Google Provider
The web_search()
tool uses Google Programmable Search Engine. To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the Programmable Search Element Paid API. Then, ensure that the following environment variables are defined:
GOOGLE_CSE_ID
— Google Custom Search Engine IDGOOGLE_CSE_API_KEY
— Google API key used to enable the Search API