Tools

Overview

Many models now have the ability to interact with client-side Python functions in order to expand their capabilities. This enables you to equip models with your own set of custom tools so they can perform a wider variety of tasks.

Inspect natively supports registering Python functions as tools and providing these tools to models that support them (currently OpenAI, Claude 3, Google Gemini, and Mistral). Inspect also includes several built-in tools (bash, python, and web_search).

Tools and Agents

One application of tools is to run them within an agent scaffold that pursues an objective over multiple interactions with a model. The scaffold uses the model to help make decisions about which tools to use and when, and orchestrates calls to the model to use the tools. This is covered in more depth in the Agents section.

Tool Basics

To demonstrate the use of tools, we’ll define a simple tool that adds two numbers, using the @tool decorator to register it with the system.

@tool
def add():
    async def execute(x: int, y: int):
        """
        Add two numbers.

        Args:
            x: First number to add.
            y: Second number to add.

        Returns:
            The sum of the two numbers.
        """
        return x + y

    return execute

Annotations

Note that we provide type annotations for both arguments:

async def execute(x: int, y: int)

Further, we provide descriptions for each parameter in the documention comment:

Args:
    x: First number to add.
    y: Second number to add.

Type annotations and descriptions are required for tool declarations so that the model can be informed which types to pass back to the tool function and what the purpose of each parameter is.

Note that you while you are required to provide default descriptions for tools and their parameters within doc comments, you can also make these dynamically customisable by users of your tool (see the section below on Tool Descriptions for details on how to do this).

Using Tools

We can use this tool in an evaluation by passing it to the use_tools() Solver:

@task
def addition_problem():
    return Task(
        dataset=[Sample(input="What is 1 + 1?", target=["2"])],
        plan=[
            use_tools(add()), 
            generate()
        ],
        scorer=match(numeric=True),
    )

Note that this tool doesn’t make network requests or do heavy computation, so is fine to run as inline Python code. If your tool does do more elaborate things, you’ll want to make sure it plays well with Inspect’s concurrency scheme. For network requests, this amounts to using async HTTP calls with httpx. For heavier computation, tools should use subprocesses as described in the next section.

Note that when using tools with models, the models do not call the Python function directly. Rather, the model generates a structured request which includes function parameters, and then Inspect calls the function and returns the result to the model.

Tool Errors

Various errors can occur during tool execution, especially when interacting with the file system or network or when using Sandbox Environments to execute code in a container sandbox. As a tool writer you need to decide how you’d like to handle error conditions. A number of approaches are possible:

  1. Notify the model that an error occurred to see whether it can recover.

  2. Catch and handle the error internally (trying another code path, etc.).

  3. Allow the error to propagate, resulting in the current Sample failing with an error state.

There are no universally correct approaches as tool usage and semantics can vary widely—some rough guidelines are provided below.

Default Handling

If you do not explicitly handle errors, then Inspect provides some default error handling behaviour. Specifically, if any of the following errors are raised they will be handled and reported to the model:

  • TimeoutError — Occurs when a call to subprocess() or sandbox().exec() times out.

  • PermissionError — Occurs when there are inadequate permissions to read or write a file.

  • UnicodeDecodeError — Occurs when the output from executing a process or reading a file is binary rather than text.

  • ToolError — Special error thrown by tools to indicate they’d like to report an error to the model.

These are all errors that are expected (in fact the SandboxEnvironemnt interface documents them as such) and possibly recoverable by the model (try a different command, read a different file, etc.). Unexpected errors (e.g. a network error communicating with a remote service or container runtime) on the other hand are not automatically handled and result in the Sample failing with an error.

Many tools can simply rely on the default handling to provide reasonable behaviour around both expected and unexpected errors.

When we say that the errors are reported directly to the model, this refers to the behaviour when using the default generate(). If on the other hand, you are have created custom scaffolding for an agent, you can intercept tool errors and apply additional filtering and logic.

Explicit Handling

In some cases a tool can implement a recovery strategy for error conditions. For example, an HTTP request might fail due to transient network issues, and retrying the request (perhaps after a delay) may resolve the problem. Explicit error handling strategies are generally applied when there are expected errors that are not already handled by Inspect’s Default Handling.

Another type of explicit handling is re-raising an error to bypass Inspect’s default handling. For example, here we catch at re-raise TimeoutError so that it fails the Sample:

try:
  result = await sandobox().exec(
    cmd=["decode", file], 
    timeout=timeout
  )
except TimeoutError:
  raise RuntimeError("Decode operation timed out.")
  

Sandboxing

Tools may have a need to interact with a sandboxed environment (e.g. to provide models with the ability to execute arbitrary bash or python commands). The active sandbox environment can be obtained via the sandbox() function. For example:

from inspect_ai.tool import ToolError, tool
from inspect_ai.util import sandbox

@tool
def list_files():
    async def execute(dir: str):
        """List the files in a directory.

        Args:
            dir (str): Directory

        Returns:
            File listing of the directory
        """
        result = await sandbox().exec(["ls", dir])
        if result.success:
            return result.stdout
        else:
            raise ToolError(result.stderr)

    return execute

The following instance methods are available to tools that need to interact with a SandboxEnvironment:

class SandboxEnvironment:
   
    async def exec(
        self,
        cmd: list[str],
        input: str | bytes | None = None,
        cwd: str | None = None,
        env: dict[str, str] = {},
        user: str | None = None,
        timeout: int | None = None,
    ) -> ExecResult[str]:
        """
        Raises:
          TimeoutError: If the specified `timeout` expires.
          UnicodeDecodeError: If an error occurs while
            decoding the command output.
          PermissionError: If the user does not have
            permission to execute the command.
        """
        ...

    async def write_file(
        self, file: str, contents: str | bytes
    ) -> None:
        """
        Raises:
          PermissionError: If the user does not have
            permission to write to the specified path.
          IsADirectoryError: If the file exists already and 
            is a directory.
        """
        ...

    async def read_file(
        self, file: str, text: bool = True
    ) -> Union[str | bytes]:
        """
        Raises:
          FileNotFoundError: If the file does not exist.
          UnicodeDecodeError: If an encoding error occurs 
            while reading the file.
            (only applicable when `text = True`)
          PermissionError: If the user does not have
            permission to read from the specified path.
          IsADirectoryError: If the file is a directory.
        """
        ...

Note that write_file() automatically creates parent directories as required if they don’t exist.

For each method there is a documented set of errors that are raised: these are expected errors and can either be caught by tools or allowed to propagate in which case they will be reported to the model for potential recovery. In addition, unexpected errors may occur (e.g. a networking error connecting to a remote container): these errors are not reported to the model and fail the Sample with an error state.

See the documentation on Sandbox Environments for additional details.

Tool Choice

By default models will use a tool if they think it’s appropriate for the given task. You can override this behaviour using the tool_choice parameter of the use_tools() Solver. For example:

# let the model decide whether to use the tool
use_tools(addition(), tool_choice="auto")

# force the use of a tool
use_tools(addition(), tool_choice=ToolFunction(name="addition"))

# prevent use of tools
use_tools(addition(), tool_choice="none")

The last form (tool_choice="none") would typically be used to turn off tool usage after an initial generation where the tool used. For example:

plan = [
  use_tools(addition(), tool_choice=ToolFunction(name="addition")),
  generate(),
  follow_up_prompt(),
  use_tools(tool_choice="none"),
  generate()
]

Tool Descriptions

Well crafted tools should include descriptions that provide models with the context required to use them correctly and productively. If you will be developing custom tools it’s worth taking some time to learn how to provide good tool definitions. Here are some resources you may find helpful:

In some cases you may want to change the default descriptions created by a tool author—for example you might want to provide better disambiguation between multiple similar tools that are used together. You also might have need to do this during development of tools (to explore what descriptions are most useful to models).

The tool_with() function enables you to take any tool and adapt its name and/or descriptions. For example:

from inspect_ai.tool import tool_with

my_add = tool_with(
  tool=add(), 
  name="my_add",
  description="a tool to add numbers", 
  parameters={
    "x": "the x argument",
    "y": "the y argument"
  })

You need not provide all of the parameters shown above, for example here are some examples where we modify just the main tool description or only a single parameter:

my_add = tool_with(add(), description="a tool to add numbers")
my_add = tool_with(add(), parameters={"x": "the x argument"})

Note that the tool_with() function returns a copy of the passed tool with modified descriptions (the passed tool retains its original descriptions).

Built-In Tools

Inspect has several built-in tools, including:

  • web_search(), which uses the Google Search API to execute and summarise web searches.

  • bash() and python(), for executing arbitrary shell and Python code.

Bash and Python

The bash() and python() tools enable execution of arbitrary shell commands and Python code, respectively. These tools require the use of a Sandbox Environment for the execution of untrusted code. For example, here is how you might use them in an evaluation where the model is asked to write code in order to solve capture the flag (CTF) challenges:

from inspect_ai.tool import bash, python

CMD_TIMEOUT = 180

@task
def intercode_ctf():
    return Task(
        dataset=read_dataset(),
        plan=[
            system_message("system.txt"),
            use_tools([
                bash(CMD_TIMEOUT), 
                python(CMD_TIMEOUT)
            ]),
            generate(),
        ],
        scorer=includes(),
        max_messages=30,
        sandbox="docker",
    )

We specify a 3-minute timeout for execution of the bash and python tools to ensure that they don’t perform extremely long running operations.

See the Agents section for more details on how to build evaluations that allow models to take arbitrary actions over a longer time horizon.