Tools

Overview

Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks.

Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools (bash, python, computer, web browser, and web_search).

Tools and Agents

One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. This is covered in more depth in the Agents section.

Built-In Tools

Inspect has several built-in tools, including:

Bash and Python for executing arbitrary shell and Python code.
Web Browser, which provides the model with a headless Chromium web browser that supports navigation, history, and mouse/keyboard interactions.
Computer, which provides the model with a desktop computer (viewed through screenshots) that supports mouse and keyboard interaction.
Web Search, which uses the Google Search API to execute and summarise web searches.

If you are only interested in using the built-in tools, check out their respective documentation links above. To learn more about creating your own tools read on immediately below.

Tool Basics

To demonstrate the use of tools, we’ll define a simple tool that adds two numbers, using the @tool decorator to register it with the system.

@tool
def add():
    async def execute(x: int, y: int):
        """
        Add two numbers.

        Args:
            x: First number to add.
            y: Second number to add.

        Returns:
            The sum of the two numbers.
        """
        return x + y

    return execute

Annotations

Note that we provide type annotations for both arguments:

async def execute(x: int, y: int)

Further, we provide descriptions for each parameter in the documention comment:

Args:
    x: First number to add.
    y: Second number to add.

Type annotations and descriptions are required for tool declarations so that the model can be informed which types to pass back to the tool function and what the purpose of each parameter is.

Note that you while you are required to provide default descriptions for tools and their parameters within doc comments, you can also make these dynamically customisable by users of your tool (see the section below on Tool Descriptions for details on how to do this).

Using Tools

We can use this tool in an evaluation by passing it to the use_tools() Solver:

@task
def addition_problem():
    return Task(
        dataset=[Sample(input="What is 1 + 1?", target=["2"])],
        solver=[
            use_tools(add()), 
            generate()
        ],
        scorer=match(numeric=True),
    )

Note that this tool doesn’t make network requests or do heavy computation, so is fine to run as inline Python code. If your tool does do more elaborate things, you’ll want to make sure it plays well with Inspect’s concurrency scheme. For network requests, this amounts to using async HTTP calls with httpx. For heavier computation, tools should use subprocesses as described in the next section.

Note that when using tools with models, the models do not call the Python function directly. Rather, the model generates a structured request which includes function parameters, and then Inspect calls the function and returns the result to the model.

Tool Errors

Various errors can occur during tool execution, especially when interacting with the file system or network or when using Sandbox Environments to execute code in a container sandbox. As a tool writer you need to decide how you’d like to handle error conditions. A number of approaches are possible:

Notify the model that an error occurred to see whether it can recover.
Catch and handle the error internally (trying another code path, etc.).
Allow the error to propagate, resulting in the current Sample failing with an error state.

There are no universally correct approaches as tool usage and semantics can vary widely—some rough guidelines are provided below.

Default Handling

If you do not explicitly handle errors, then Inspect provides some default error handling behaviour. Specifically, if any of the following errors are raised they will be handled and reported to the model:

TimeoutError — Occurs when a call to subprocess() or sandbox().exec() times out.
PermissionError — Occurs when there are inadequate permissions to read or write a file.
UnicodeDecodeError — Occurs when the output from executing a process or reading a file is binary rather than text.
OutputLimitExceededError - Occurs when one or both of the output streams from sandbox().exec() exceed 10 MiB or when attempting to read a file over 100 MiB in size.
ToolError — Special error thrown by tools to indicate they’d like to report an error to the model.

These are all errors that are expected (in fact the SandboxEnvironment interface documents them as such) and possibly recoverable by the model (try a different command, read a different file, etc.). Unexpected errors (e.g. a network error communicating with a remote service or container runtime) on the other hand are not automatically handled and result in the Sample failing with an error.

Many tools can simply rely on the default handling to provide reasonable behaviour around both expected and unexpected errors.

When we say that the errors are reported directly to the model, this refers to the behaviour when using the default generate(). If on the other hand, you are have created custom scaffolding for an agent, you can intercept tool errors and apply additional filtering and logic.

Explicit Handling

In some cases a tool can implement a recovery strategy for error conditions. For example, an HTTP request might fail due to transient network issues, and retrying the request (perhaps after a delay) may resolve the problem. Explicit error handling strategies are generally applied when there are expected errors that are not already handled by Inspect’s Default Handling.

Another type of explicit handling is re-raising an error to bypass Inspect’s default handling. For example, here we catch at re-raise TimeoutError so that it fails the Sample:

try:
  result = await sandbox().exec(
    cmd=["decode", file], 
    timeout=timeout
  )
except TimeoutError:
  raise RuntimeError("Decode operation timed out.")

Sandboxing

Tools may have a need to interact with a sandboxed environment (e.g. to provide models with the ability to execute arbitrary bash or python commands). The active sandbox environment can be obtained via the sandbox() function. For example:

from inspect_ai.tool import ToolError, tool
from inspect_ai.util import sandbox

@tool
def list_files():
    async def execute(dir: str):
        """List the files in a directory.

        Args:
            dir (str): Directory

        Returns:
            File listing of the directory
        """
        result = await sandbox().exec(["ls", dir])
        if result.success:
            return result.stdout
        else:
            raise ToolError(result.stderr)

    return execute

The following instance methods are available to tools that need to interact with a SandboxEnvironment:

class SandboxEnvironment:
   
    async def exec(
        self,
        cmd: list[str],
        input: str | bytes | None = None,
        cwd: str | None = None,
        env: dict[str, str] = {},
        user: str | None = None,
        timeout: int | None = None,
        timeout_retry: bool = True
    ) -> ExecResult[str]:
        """
        Raises:
          TimeoutError: If the specified `timeout` expires.
          UnicodeDecodeError: If an error occurs while
            decoding the command output.
          PermissionError: If the user does not have
            permission to execute the command.
          OutputLimitExceededError: If an output stream
            exceeds the 10 MiB limit.
        """
        ...

    async def write_file(
        self, file: str, contents: str | bytes
    ) -> None:
        """
        Raises:
          PermissionError: If the user does not have
            permission to write to the specified path.
          IsADirectoryError: If the file exists already and 
            is a directory.
        """
        ...

    async def read_file(
        self, file: str, text: bool = True
    ) -> Union[str | bytes]:
        """
        Raises:
          FileNotFoundError: If the file does not exist.
          UnicodeDecodeError: If an encoding error occurs 
            while reading the file.
            (only applicable when `text = True`)
          PermissionError: If the user does not have
            permission to read from the specified path.
          IsADirectoryError: If the file is a directory.
          OutputLimitExceededError: If the file size
            exceeds the 100 MiB limit.
        """
        ...

    async def connection(self) -> SandboxConnection:
        """
        Raises:
           NotImplementedError: For sandboxes that don't provide connections
           ConnectionError: If sandbox is not currently running.
        """

The read_file() method should preserve newline constructs (e.g. crlf should be preserved not converted to lf). This is equivalent to specifying newline="" in a call to the Python open() function. Note that write_file() automatically creates parent directories as required if they don’t exist.

The connection() method is optional, and provides commands that can be used to login to the sandbox container from a terminal or IDE.

Note that to deal with potential unreliability of container services, the exec() method includes a timeout_retry parameter that defaults to True. For sandbox implementations this parameter is advisory (they should only use it if potential unreliablity exists in their runtime). No more than 2 retries should be attempted and both with timeouts less than 60 seconds. If you are executing commands that are not idempotent (i.e. the side effects of a failed first attempt may affect the results of subsequent attempts) then you can specify timeout_retry=False to override this behavior.

For each method there is a documented set of errors that are raised: these are expected errors and can either be caught by tools or allowed to propagate in which case they will be reported to the model for potential recovery. In addition, unexpected errors may occur (e.g. a networking error connecting to a remote container): these errors are not reported to the model and fail the Sample with an error state.

See the documentation on Sandbox Environments for additional details.

Tool Choice

By default models will use a tool if they think it’s appropriate for the given task. You can override this behaviour using the tool_choice parameter of the use_tools() Solver. For example:

# let the model decide whether to use the tool
use_tools(addition(), tool_choice="auto")

# force the use of a tool
use_tools(addition(), tool_choice=ToolFunction(name="addition"))

# prevent use of tools
use_tools(addition(), tool_choice="none")

The last form (tool_choice="none") would typically be used to turn off tool usage after an initial generation where the tool used. For example:

solver = [
  use_tools(addition(), tool_choice=ToolFunction(name="addition")),
  generate(),
  follow_up_prompt(),
  use_tools(tool_choice="none"),
  generate()
]

Tool Descriptions

Well crafted tools should include descriptions that provide models with the context required to use them correctly and productively. If you will be developing custom tools it’s worth taking some time to learn how to provide good tool definitions. Here are some resources you may find helpful:

In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).

The tool_with() function enables you to take any tool and adapt its name and/or descriptions. For example:

from inspect_ai.tool import tool_with

my_add = tool_with(
  tool=add(), 
  name="my_add",
  description="a tool to add numbers", 
  parameters={
    "x": "the x argument",
    "y": "the y argument"
  })

You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:

my_add = tool_with(add(), description="a tool to add numbers")
my_add = tool_with(add(), parameters={"x": "the x argument"})

Note that the tool_with() function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions).

Dynamic Tools

As described above, normally tools are defined using @tool decorators and documentation comments. It’s also possible to create a tool dynamically from any function by creating a ToolDef. For example:

from inspect_ai.solver import use_tools
from inspect_ai.tool import ToolDef

async def addition(x: int, y: int):
    return x + y

add = ToolDef(
    tool=addition,
    name="add",
    description="A tool to add numbers", 
    parameters={
        "x": "the x argument",
        "y": "the y argument"
    })
)

use_tools([add])

This is effectively what happens under the hood when you use the @tool decorator. There is one critical requirement for functions that are bound to tools using ToolDef: type annotations must be provided in the function signature (e.g. x: int, y: int).

For Inspect APIs, ToolDef can generally be used anywhere that Tool can be used (use_tools(), setting state.tools, etc.). If you are using a 3rd party API that does not take Tool in its interface, use the ToolDef.as_tool() method to adapt it. For example:

from inspect_agents import my_agent
agent = my_agent(tools=[add.as_tool()])

If on the other hand you want to get the ToolDef for an existing tool (e.g. to discover its name, description, and parameters) you can just pass the Tool to the ToolDef constructor (including whatever overrides for name, etc. you want):

from inspect_ai.tool import ToolDef, bash
bash_def = ToolDef(bash())

Parallel Tool Calls

Models will often provide multiple tool calls to evaluate. By default, Inspect executes these tool calls in parallel. While this can provide a performance improvement, it might not be compatible with semantics of some tools (for example, if they manage some global state between calls).

You can opt-out of parallel tool calling by adding parallel=False to the @tool decorator. For example, the built in web browsing tools do this as follows:

@tool(parallel=False)
def web_browser_go() -> Tool:
    ...

Specifying parallel=False results in two behaviours:

Models that support turning off parallel tool calling (currently OpenAI and Grok) will have it disabled when tools with parallel=False are passed to generate().
Inspect will execute tool calls serially (so that even for models that don’t let you disable parallel tool calling, you can still be assured they will not ever run in parallel).

Bash and Python

The bash() and python() tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of a Sandbox Environment for the execution of untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:

from inspect_ai.tool import bash, python

CMD_TIMEOUT = 180

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        solver=[
            system_message("system.txt"),
            use_tools([
                bash(CMD_TIMEOUT), 
                python(CMD_TIMEOUT)
            ]),
            generate(),
        ],
        scorer=includes(),
        message_limit=30,
        sandbox="docker",
    )

We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.

See the Agents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.

Web Browser

The web browser tools provides models with the ability to browse the web using a headless Chromium browser. Navigation, history, and mouse/keyboard interactions are all supported.

Configuration

Under the hood, the web browser is an instance of Chromium orchestrated by Playwright, and runs in its own dedicated Docker container. Therefore, to use the web_browser tool you should reference the aisiuk/inspect-web-browser-tool Docker image in your compose.yaml. For example, here we use it as our default image:

compose.yaml

services:
  default:
    image: aisiuk/inspect-web-browser-tool
    init: true

Here, we add a dedicated web_browser service:

compose.yaml

services:
  default:
    image: "python:3.12-bookworm"
    init: true
    command: "tail -f /dev/null"
  web_browser:
    image: aisiuk/inspect-web-browser-tool
    init: true

Rather than using the aisiuk/inspect-web-browser-tool image, you can also just include the web browser service components in a custom image (see Custom Images below for details).

Task Setup

A task configured to use the web browser tools might look like this:

from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import bash, python, web_browser

@task
def browser_task():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([bash(), python()] + web_browser()),
            generate(),
        ],
        scorer=match(),
        sandbox=("docker", "compose.yaml"),
    )

Note that unlike some other tool functions like bash(), the web_browser() function returns a list of tools. Therefore, we concatenate it with a list of the other tools we are using in the call to use_tools().

Browsing

If you review the transcripts of a sample with access to the web browser tool, you’ll notice that there are several distinct tools made available for control of the web browser. These tools include:

Tool	Description
`web_browser_go(url)`	Navigate the web browser to a URL.
`web_browser_click(element_id)`	Click an element on the page currently displayed by the web browser.
`web_browser_type(element_id)`	Type text into an input on a web browser page.
`web_browser_type_submit(element_id, text)`	Type text into a form input on a web browser page and press ENTER to submit the form.
`web_browser_scroll(direction)`	Scroll the web browser up or down by one page.
`web_browser_forward()`	Navigate the web browser forward in the browser history.
`web_browser_back()`	Navigate the web browser back in the browser history.
`web_browser_refresh()`	Refresh the current page of the web browser.

The return value of each of these tools is a web accessibility tree for the page, which provides a clean view of the content, links, and form fields available on the page (you can look at the accessibility tree for any web page using Chrome Developer Tools).

Disabling Interactions

You can use the web browser tools with page interactions disabled by specifying interactive=False, for example:

use_tools(web_browser(interactive=False))

In this mode, the interactive tools (web_browser_click(), web_browser_type(), and web_browser_type_submit()) are not made available to the model.

Custom Images

Above we demonstrated how to use the pre-configured Inspect web browser container. If you prefer to incorporate the headless web browser and its dependencies into another container that is also supported.

To do this, reference the Dockerfile used in the built-in web browser container and ensure that the dependencies, application files, and server run command it uses are also in your container definition:

# Install playwright
RUN pip install playwright 
RUN playwright install
RUN playwright install-deps 

# Install other dependencies
RUN pip install dm-env-rpc pillow bs4 lxml

# Copy Python files alongside the Dockerfile
COPY *.py ./

# Run the server
CMD ["python3", "/app/web_browser/web_server.py"]

Note that all of the Python files in the _resources directory alongside the Dockerfile need to be available for copying when building the container.

Computer

The computer() tool provides models with a computer desktop environment along with the ability to view the screen and perform mouse and keyboard gestures. The computer tool is based on the Anthropic Computer Use Beta reference implementation and works with any model that supports image input.

Configuration

The computer() tool runs within a Docker container. To use it with a task you need to reference the aisiuk/inspect-computer-tool:latest image in your Docker compose file. For example:

compose.yaml

services:
  default:
    image: aisiuk/inspect-computer-tool:latest

You can configure the container to not have Internet access as follows:

compose.yaml

services:
  default:
    image: aisiuk/inspect-computer-tool:latest
    network_mode: none

Note that if you’d like to be able to view the model’s interactions with the computer desktop in realtime, you will need to also do some port mapping to enable a VNC connection with the container. See the VNC Client section below for details on how to do this.

The aisiuk/inspect-computer-tool:latest image is based on the ubuntu:22.04 image and includes the following additional applications pre-installed:

Firefox
VS Code
Xpdf
Xpaint
galculator

Task Setup

A task configured to use the computer tool might look like this:

from inspect_ai import Task, task
from inspect_ai.scorer import match
from inspect_ai.solver import generate, use_tools
from inspect_ai.tool import computer

@task
def computer_task():
    return Task(
        dataset=read_dataset(),
        solver=[
            use_tools([computer()]),
            generate(),
        ],
        scorer=match(),
        sandbox=("docker", "compose.yaml"),
    )

Options

The computer tool supports the following options:

Option	Description
`max_screenshots`	The maximum number of screenshots to play back to the model as input. Defaults to 1 (set to `None` to have no limit).
`timeout`	Timeout in seconds for computer tool actions. Defaults to 180 (set to `None` for no timeout).

For example:

solver=[
    use_tools([computer(max_screenshots=2, timeout=300)]),
    generate()
]

Examples

Two of the Inspect examples demonstrate basic computer use:

computer — Three simple computing tasks as a minimal demonstration of computer use.
```
inspect eval examples/computer
```

intervention — Computer task driven interactively by a human operator.

inspect eval examples/intervention -T mode=computer --display conversation

VNC Client

You can use a VNC connection to the container to watch computer use in real-time. This requires some additional port-mapping in the Docker compose file. You can define dynamic port ranges for VNC (5900) and a browser based noVNC client (6080) with the following ports entries:

compose.yaml

services:
  default:
    image: aisiuk/inspect-computer-tool:latest
    ports:
      - "5900"
      - "6080"

To connect to the container for a given sample, locate the sample in the Running Samples UI and expand the sample info panel at the top:

Click on the link for the noVNC browser client, or use a native VNC client to connect to the VNC port. Note that the VNC server will take a few seconds to start up so you should give it some time and attempt to reconnect as required if the first connection fails.

The browser based client provides a view-only interface. If you use a native VNC client you should also set it to “view only” so as to not interfere with the model’s use of the computer. For example, for Real VNC Viewer:

Approval

If the container you are using is connected to the Internet, you may want to configure human approval for a subset of computer tool actions. Here are the possible actions (specified using the action parameter to the computer tool):

key: Press a key or key-combination on the keyboard.
type: Type a string of text on the keyboard.
cursor_position: Get the current (x, y) pixel coordinate of the cursor on the screen.
mouse_move: Move the cursor to a specified (x, y) pixel coordinate on the screen.
Example: execute(action=“mouse_move”, coordinate=(100, 200))
left_click: Click the left mouse button.
left_click_drag: Click and drag the cursor to a specified (x, y) pixel coordinate on the screen.
right_click: Click the right mouse button.
middle_click: Click the middle mouse button.
double_click: Double-click the left mouse button.
screenshot: Take a screenshot.

Here is an approval policy that requires approval for key combos (e.g. Enter or a shortcut) and mouse clicks:

approval.yaml

approvers:
  - name: human
    tools:
      - computer(action='key'
      - computer(action='left_click'
      - computer(action='middle_click'
      - computer(action='double_click'

  - name: auto
    tools: "*"

Note that since this is a prefix match and there could be other arguments, we don’t end the tool match pattern with a parentheses.

You can apply this policy using the --approval commmand line option:

inspect eval computer.py --approval approval.yaml

Tool Binding

The computer tool’s schema is based on the standard Anthropoic computer tool-type. When using Claude 3.5 the coputer tool will automatically bind to the native Claude computer tool definition. This presumably provides improved performance due to fine tuning on the use of the tool but we have not verified this.

If you want to experiement with bypassing the native Claude computer tool type and just register the computer tool as a normal function based tool then specify the --no-internal-tools generation option as follows:

inspect eval computer.py --no-internal-tools

Web Search

The web_search() tool provides models the ability to enhance their context window by performing a search. By default web searches retrieve 10 results from a provider, uses a model to determine if the contents is relevant then returns the top 3 relevant search results to the main model. Here is the definition of the web_search() function:

def web_search(
    provider: Literal["google"] = "google",
    num_results: int = 3,
    max_provider_calls: int = 3,
    max_connections: int = 10,
    model: str | Model | None = None,
) -> Tool:
    ...

You can use the web_search() tool like this:

from inspect_ai.tool import web_search

solver=[
    use_tools(web_search()), 
    generate()
],

Web search options include:

provider—Web search provider (currently only Google is supported, see below for instructions on setup and configuration for Google).
num_results—How many search results to return to the main model (defaults to 5).
max_provider_calls—Number of times to retrieve more links from the search provider in case previous ones were irrelevant (defaults to 3).
max_connections—Maximum number of concurrent connections to the search API provider (defaults to 10).
model—Model to use to determine if search results are relevant (defaults to the model currently being evaluated).

Google Provider

The web_search() tool uses Google Programmable Search Engine. To use it you will therefore need to setup your own Google Programmable Search Engine and also enable the Programmable Search Element Paid API. Then, ensure that the following environment variables are defined:

GOOGLE_CSE_ID — Google Custom Search Engine ID
GOOGLE_CSE_API_KEY — Google API key used to enable the Search API