Solvers

Overview

Solvers are the heart of Inspect evaluations and can serve a wide variety of purposes, including:

  1. Providing system prompts
  2. Prompt engineering (e.g. chain of thought)
  3. Model generation
  4. Self critique
  5. Multi-turn dialog
  6. Running an agent scaffold

Tasks have a single top-level solver that defines an execution plan. This solver could be implemented with arbitrary Python code (calling the model as required) or could consist of a set of other solvers composed together. Solvers can therefore play two differnet roles:

  1. Composite specifications for task execution; and

  2. Components that can be chained together.

Example

Here’s an example task definition that composes a few standard solver components:

@task
def theory_of_mind():
    return Task(
        dataset=json_dataset("theory_of_mind.jsonl"),
        solver=[
            system_message("system.txt"),
            prompt_template("prompt.txt"),
            generate(),
            self_critique()
        ],
        scorer=model_graded_fact(),
    )

In this example we pass a list of solver components directly to the Task. More often, though we’ll wrap our solvers in an @solver decorated function to create a composite solver:

@solver
def critique(
    system_prompt = "system.txt",
    user_prompt = "prompt.txt",
):
    return chain(
        system_message(system_prompt),
        prompt_template(user_prompt),
        generate(),
        self_critique()
    )

@task
def theory_of_mind():
    return Task(
        dataset=json_dataset("theory_of_mind.jsonl"),
        solver=critique(),
        scorer=model_graded_fact(),
    )

Composite solvers by no means need to be implemented using chains. While chains are frequently used in more straightforward knowledge and reasoning evaluations, fully custom solver functions are often used for multi-turn dialog and agent evaluations.

This section covers mostly solvers as components (both built in and creating your own). The Agents section describes fully custom solvers in more depth.

Task States

Before we get into the specifics of how solvers work, we should describe TaskState, which is the fundamental data structure they act upon. A TaskState consists principally of chat history (derived from input and then extended by model interactions) and model output:

class TaskState:
    messages: list[ChatMessage],
    output: ModelOutput

Note that the TaskState definition above is simplified: there are other fields in a TaskState but we’re excluding them here for clarity.

A prompt engineering solver will modify the content of messages. A model generation solver will call the model, append an assistant message, and set the output (a multi-turn dialog solver might do this in a loop).

Solver Function

We’ve covered the role of solvers in the system, but what exactly are solvers technically? A solver is a Python function that takes a TaskState and generate function, and then transforms and returns the TaskState (the generate function may or may not be called depending on the solver).

async def solve(state: TaskState, generate: Generate):
    # do something useful with state (possibly
    # calling generate for more advanced solvers)
    # then return the state
    return state

The generate function passed to solvers is a convenience function that takes a TaskState, calls the model with it, appends the assistant message, and sets the model output. This is never used by prompt engineering solvers and often used by more complex solvers that want to have multiple model interactions.

Here are what some of the built-in solvers do with the TaskState:

  1. The system_message() solver inserts a system message into the chat history.

  2. The chain_of_thought() solver takes the original user prompt and re-writes it to ask the model to use chain of thought reasoning to come up with its answer.

  3. The generate() solver just calls the generate function on the state. In fact, this is the full source code for the generate() solver:

    async def solve(state: TaskState, generate: Generate):
        return await generate(state)
  4. The self_critique() solver takes the ModelOutput and then sends it to another model for critique. It then replays this critique back within the messages stream and re-calls generate to get a refined answer.

You can also imagine solvers that call other models to help come up with a better prompt, or solvers that implement a multi-turn dialog. Anything you can imagine is possible.

Built-In Solvers

Inspect has a number of built-in solvers, each of which can be customised in some fashion. Built in solvers can be imported from the inspect_ai.solver module. Below is a summary of these solvers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.

  • system_message()

    Prepend role=“system” message to the list of messages (will follow any other system messages it finds in the message stream). Also automatically substitutes any variables defined in sample metadata as well as any other custom named paramters passed in params.

  • prompt_template()

    Modify the user prompt by substituting the current prompt into the {prompt} placeholder within the specified template. Also automatically substitutes any variables defined in sample metadata as well as any other custom named paramters passed in params.

  • chain_of_thought()

    Standard chain of thought template with {prompt} substitution variable. Asks the model to provide the final answer on a line by itself at the end for easier scoring.

  • use_tools()

    Define the set tools available for use by the model during generate().

  • generate()

    As illustrated above, just a simple call to generate(state). This is the default solver if no solver is specified.

  • self_critique()

    Prompts the model to critique the results of a previous call to generate() (note that this need not be the same model as they one you are evaluating—use the model parameter to choose another model). Makes use of {question} and {completion} template variables. Also automatically substitutes any variables defined in sample metadata

  • multiple_choice()

    A solver which presents A,B,C,D style choices from input samples and calls generate() to yield model output. This solver should nearly always paired with the choices() scorer. Learn more about Multiple Choice in the section below.

Multiple Choice

Here is the declaration for the multiple_choice() solver:

@solver
def multiple_choice(
    *,
    template: str | None = None,
    cot: bool = False,
    shuffle: bool | Random = False,
    multiple_correct: bool = False,
    
) -> Solver:

We’ll present an example and then discuss the various options below (in most cases you won’t need to customise these). First though there are some special considerations to be aware of when using the multiple_choice() solver:

  1. The Sample must include the available choices. Choices should not include letters (as they are automatically included when presenting the choices to the model).
  2. The Sample target should be a capital letter (e.g. A, B, C, D, etc.)
  3. You should always pair it with the choice() scorer in your task definition.
  4. It calls generate() internally, so you do need to separately include the generate() solver.

Example

Below is a full example of reading a dataset for use with multiple choice() and using it in an evaluation task. The underlying data in mmlu.csv has the following form:

Question A B C D Answer
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. 0 4 2 6 B
Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5. 8 2 24 120 C

Here is the task definition:

@task
def mmlu():
    # read the dataset
    dataset = csv_dataset(
        "mmlu.csv", 
        sample_fields=record_to_sample
    )

    # task with multiple choice() and choice() scorer
    return Task(
        dataset=task_dataset,
        solver=multiple_choice(),
        scorer=choice(),
    )

def record_to_sample(record):
    return Sample(
        input=record["Question"],
        choices=[
            str(record["A"]),
            str(record["B"]),
            str(record["C"]),
            str(record["D"]),
        ],
        target=record["Answer"],
    )

We use the record_to_sample() function to read the choices along with the target (which should always be a letter ,e.g. A, B, C, or D). Note that you should not include letter prefixes in the choices, as they will be included automatically when presenting the question to the model.

Options

The following options are available for further customisation of the multiple choice solver:

Option Description
template Use template to provide an alternate prompt template (note that if you do this your template should handle prompting for multiple_correct directly if required). You can access the built in templates using the MultipleChoiceTemplate enum.
cot Whether the solver should perform chain-of-thought reasoning before answering (defaults to False). NOTE: this has no effect if you provide a custom template.
multiple_correct By default, multiple choice questions have a single correct answer. Set multiple_correct=True if your target has defined multiple correct answers (for example, a target of ["B", "C"]). In this case the model is prompted to provide one or more answers, and the sample is scored correct only if each of these answers are provided. NOTE: this has no effect if you provide a custom template.
shuffle If you specify shuffle=True, then the order of the answers presented to the model will be randomised (this may or may not affect results, depending on the nature of the questions and the model being evaluated).

Self Critique

Here is the declaration for the self_critique() solver:

def self_critique(
    critique_template: str | None = None,
    completion_template: str | None = None,
    model: str | Model | None = None,
) -> Solver:

There are two templates which correspond to the one used to solicit critique and the one used to play that critique back for a refined answer (default templates are provided for both).

You will likely want to experiment with using a distinct model for generating critiques (by default the model being evaluated is used).

Custom Solvers

In this section we’ll take a look at the source code for a couple of the built in solvers as a jumping off point for implementing your own solvers. A solver is an implementation of the Solver protocol (a function that transforms a TaskState):

async def solve(state: TaskState, generate: Generate) -> TaskState:
    # do something useful with state, possibly calling generate()
    # for more advanced solvers
    return state

Typically solvers can be customised with parameters (e.g. template for prompt engineering solvers). This means that a Solver is actually a function which returns the solve() function referenced above (this will become more clear in the examples below).

Task States

Before presenting the examples we’ll take a more in-depth look at the TaskState class. Task states consist of both lower level data members (e.g. messages, output) as well as a number of convenience properties. The core members of TaskState that are modified by solvers are messages / user_prompt and output:

Member Type Description
messages list[ChatMessage] Chat conversation history for sample. It is automatically appended to by the generate() solver, and is often manipulated by other solvers (e.g. for prompt engineering or elicitation).
user_prompt ChatMessageUser Convenience property for accessing the first user message in the message history (commonly used for prompt engineering).
output ModelOutput The ‘final’ model output once we’ve completed all solving. This field is automatically updated with the last “assistant” message by the generate() solver.

Note that the generate() solver automatically updates both the messages and output fields. For very simple evaluations modifying the user_prompt and then calling generate() encompasses all of the required interaction with TaskState.

Sometimes its important to have access to the original prompt input for the task (as other solvers may have re-written or even removed it entirely). This is available using the input and input_text properties:

Member Type Description
input str | list[ChatMessage] Original Sample input.
input_text str Convenience function for accessing the initial input from the Sample as a string.

There are several other fields used to provide contextual data from either the task sample or evaluation:

Member Type Description
sample_id int | str Unique ID for sample.
epoch int Epoch for sample.
metadata dict Original metadata from Sample
choices list[str] | None Choices from sample (used only in multiple-choice evals).
model ModelName Name of model currently being evaluated.

Task states also include available tools as well as guidance for the model on which tools to use (if you haven’t yet encountered the concept of tool use in language models, don’t worry about understanding these fields, the Tools article provides a more in-depth treatment):

Member Type Description
tools list[Tool] Tools available to the model
tool_choice ToolChoice Tool choice directive.

These fields are typically modified via the use_tools() solver, but they can also be modified directly for more advanced use cases.

Example: Prompt Template

Here’s the code for the prompt_template() solver:

@solver
def prompt_template(template: str, **params: dict[str, Any]):

    # determine the prompt template
    prompt_template = resource(template)

    async def solve(state: TaskState, generate: Generate) -> TaskState:
        prompt = state.user_prompt
        kwargs = state.metadata | params
        prompt.text = prompt_template.format(prompt=prompt.text, **kwargs)
        return state

    return solve

A few things to note about this implementation:

  1. The function applies the @solver decorator—this registers the Solver with Inspect, making it possible to capture its name and parameters for logging, as well as make it callable from a configuration file (e.g. a YAML specification of an eval).

  2. The solve() function is declared as async. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this solver doesn’t call generate() but others will).

  3. The resource() function is used to read the specified template. This function accepts a string, file, or URL as its argument, and then returns a string with the contents of the resource.

  4. We make use of the user_prompt property on the TaskState. This is a convenience property for locating the first role="user" message (otherwise you might need to skip over system messages, etc). Since this is a string templating solver, we use the state.user_prompt.text property (so we are dealing with prompt as a string, recall that it can also be a list of messages).

  5. We make sample metadata available to the template as well as any params passed to the function.

Example: Self Critique

Here’s the code for the self_critique() solver:

DEFAULT_CRITIQUE_TEMPLATE = r"""
Given the following question and answer, please critique the answer.
A good answer comprehensively answers the question and NEVER refuses
to answer. If the answer is already correct do not provide critique
- simply respond 'The original answer is fully correct'.

[BEGIN DATA]
***
[Question]: {question}
***
[Answer]: {completion}
***
[END DATA]

Critique: """

DEFAULT_CRITIQUE_COMPLETION_TEMPLATE = r"""
Given the following question, initial answer and critique please
generate an improved answer to the question:

[BEGIN DATA]
***
[Question]: {question}
***
[Answer]: {completion}
***
[Critique]: {critique}
***
[END DATA]

If the original answer is already correct, just repeat the
original answer exactly. You should just provide your answer to
the question in exactly this format:

Answer: <your answer> """

@solver
def self_critique(
    critique_template: str | None = None,
    completion_template: str | None = None,
    model: str | Model | None = None,
) -> Solver:
    # resolve templates
    critique_template = resource(
        critique_template or DEFAULT_CRITIQUE_TEMPLATE
    )
    completion_template = resource(
        completion_template or DEFAULT_CRITIQUE_COMPLETION_TEMPLATE
    )

    # resolve critique model
    model = get_model(model)

    async def solve(state: TaskState, generate: Generate) -> TaskState:
        # run critique
        critique = await model.generate(
            critique_template.format(
                question=state.input_text,
                completion=state.output.completion,
            )
        )

        # add the critique as a user message
        state.messages.append(
            ChatMessageUser(
                content=completion_template.format(
                    question=state.input_text,
                    completion=state.output.completion,
                    critique=critique.completion,
                ),
            )
        )

        # regenerate
        return await generate(state)

    return solve

Note that calls to generate() (for both the critique model and the model being evaluated) are called with await—this is critical to ensure that the solver participates correctly in the scheduling of generation work.

Scoring in Solvers

The solver-based scoring feature described below is currently available only in the development version of Inspect. To install the development version from GitHub:

pip install git+https://github.com/UKGovernmentBEIS/inspect_ai

Typically, solvers don’t score samples but rather leave that to externally specified scorers. However, in some cases it is more convenient to have solvers also do scoring (e.g. when there is high coupling between the solver and scoring). The following two task state fields can be used for scoring:

Here is a trivial example of the code that might be used to yield scores from a solver:
Member Type Description
target Target Scoring target from Sample
scores dict[str, Score] Optional scores.
async def solve(state: TaskState, generate: Generate):
    # ...perform solver work
    
    # score
    correct = state.output.completion == state.target.text
    state.scores = { "correct": Score(value=correct) }
    return state

Note that scores yielded by a Solver are combined with scores from the normal scoring provided by the scorer(s) defined for a Task.

Intermediate Scoring

In some cases it is useful for a solver to score a task directly to generate an intermediate score or assist in deciding whether or how to continue. You can do this using the score() function:

from inspect_ai.scorer import score

def solver_that_scores() -> Solver:
    async def solve(state: TaskState, generate: Generate) -> TaskState:
        
        # use score(s) to determine next step
        scores = await score(state)
        
        return state
    
    return solver

Note that the score() function returns a list of Score (as its possible that a task could have multiple scorers).

Concurrency

When creating custom solvers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your solver is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism for a more in depth discussion.

Early Termination

In some cases a solver has the context available to request an early termination of the sample (i.e. don’t call the rest of the solvers). In this case, setting the TaskState.completed field will result in forgoing remaining solvers. For example, here’s a simple solver that terminates the sample early:

@solver
def complete_task():
    async def solve(state: TaskState, generate: Generate):
        state.completed = True
        return state

    return solve

Early termination might also occur if you specify the message_limit option and the conversation exceeds that limit:

# could terminate early
eval(my_task, message_limit = 10)