Solvers
Overview
Solvers are the heart of Inspect evaluations and can serve a wide variety of purposes, including:
- Providing system prompts
- Prompt engineering (e.g. chain of thought)
- Model generation
- Self critique
- Multi-turn dialog
- Running an agent scaffold
Tasks have a single top-level solver that defines an execution plan. This solver could be implemented with arbitrary Python code (calling the model as required) or could consist of a set of other solvers composed together. Solvers can therefore play two differnet roles:
Composite specifications for task execution; and
Components that can be chained together.
Example
Here’s an example task definition that composes a few standard solver components:
@task
def theory_of_mind():
return Task(
=json_dataset("theory_of_mind.jsonl"),
dataset=[
solver"system.txt"),
system_message("prompt.txt"),
prompt_template(
generate(),
self_critique()
],=model_graded_fact(),
scorer )
In this example we pass a list of solver components directly to the Task
. More often, though we’ll wrap our solvers in an @solver
decorated function to create a composite solver:
@solver
def critique(
= "system.txt",
system_prompt = "prompt.txt",
user_prompt
):return chain(
system_message(system_prompt),
prompt_template(user_prompt),
generate(),
self_critique()
)
@task
def theory_of_mind():
return Task(
=json_dataset("theory_of_mind.jsonl"),
dataset=critique(),
solver=model_graded_fact(),
scorer )
Composite solvers by no means need to be implemented using chains. While chains are frequently used in more straightforward knowledge and reasoning evaluations, fully custom solver functions are often used for multi-turn dialog and agent evaluations.
This section covers mostly solvers as components (both built in and creating your own). The Agents section describes fully custom solvers in more depth.
Task States
Before we get into the specifics of how solvers work, we should describe TaskState
, which is the fundamental data structure they act upon. A TaskState
consists principally of chat history (derived from input
and then extended by model interactions) and model output:
class TaskState:
list[ChatMessage],
messages: output: ModelOutput
Note that the TaskState
definition above is simplified: there are other fields in a TaskState
but we’re excluding them here for clarity.
A prompt engineering solver will modify the content of messages
. A model generation solver will call the model, append an assistant message
, and set the output
(a multi-turn dialog solver might do this in a loop).
Solver Function
We’ve covered the role of solvers in the system, but what exactly are solvers technically? A solver is a Python function that takes a TaskState
and generate
function, and then transforms and returns the TaskState
(the generate
function may or may not be called depending on the solver).
async def solve(state: TaskState, generate: Generate):
# do something useful with state (possibly
# calling generate for more advanced solvers)
# then return the state
return state
The generate
function passed to solvers is a convenience function that takes a TaskState
, calls the model with it, appends the assistant message, and sets the model output. This is never used by prompt engineering solvers and often used by more complex solvers that want to have multiple model interactions.
Here are what some of the built-in solvers do with the TaskState
:
The
system_message()
solver inserts a system message into the chat history.The
chain_of_thought()
solver takes the original user prompt and re-writes it to ask the model to use chain of thought reasoning to come up with its answer.The
generate()
solver just calls thegenerate
function on thestate
. In fact, this is the full source code for thegenerate()
solver:async def solve(state: TaskState, generate: Generate): return await generate(state)
The
self_critique()
solver takes theModelOutput
and then sends it to another model for critique. It then replays this critique back within themessages
stream and re-callsgenerate
to get a refined answer.
You can also imagine solvers that call other models to help come up with a better prompt, or solvers that implement a multi-turn dialog. Anything you can imagine is possible.
Built-In Solvers
Inspect has a number of built-in solvers, each of which can be customised in some fashion. Built in solvers can be imported from the inspect_ai.solver
module. Below is a summary of these solvers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
system_message()
Prepend role=“system”
message
to the list of messages (will follow any other system messages it finds in the message stream). Also automatically substitutes any variables defined in samplemetadata
as well as any other custom named paramters passed inparams
.prompt_template()
Modify the user prompt by substituting the current prompt into the
{prompt}
placeholder within the specified template. Also automatically substitutes any variables defined in samplemetadata
as well as any other custom named paramters passed inparams
.chain_of_thought()
Standard chain of thought template with
{prompt}
substitution variable. Asks the model to provide the final answer on a line by itself at the end for easier scoring.use_tools()
Define the set tools available for use by the model during
generate()
.generate()
As illustrated above, just a simple call to
generate(state)
. This is the default solver if nosolver
is specified.self_critique()
Prompts the model to critique the results of a previous call to
generate()
(note that this need not be the same model as they one you are evaluating—use themodel
parameter to choose another model). Makes use of{question}
and{completion}
template variables. Also automatically substitutes any variables defined in samplemetadata
multiple_choice()
A solver which presents A,B,C,D style
choices
from input samples and callsgenerate()
to yield model output. This solver should nearly always paired with thechoices()
scorer. Learn more about Multiple Choice in the section below.
Multiple Choice
Here is the declaration for the multiple_choice()
solver:
@solver
def multiple_choice(
*,
str | None = None,
template: bool = False,
cot: bool | Random = False,
shuffle: bool = False,
multiple_correct:
-> Solver: )
We’ll present an example and then discuss the various options below (in most cases you won’t need to customise these). First though there are some special considerations to be aware of when using the multiple_choice()
solver:
- The
Sample
must include the availablechoices
. Choices should not include letters (as they are automatically included when presenting the choices to the model). - The
Sample
target
should be a capital letter (e.g. A, B, C, D, etc.) - You should always pair it with the
choice()
scorer in your task definition. - It calls
generate()
internally, so you do need to separately include thegenerate()
solver.
Example
Below is a full example of reading a dataset for use with multiple choice()
and using it in an evaluation task. The underlying data in mmlu.csv
has the following form:
Question | A | B | C | D | Answer |
---|---|---|---|---|---|
Find the degree for the given field extension Q(sqrt(2), sqrt(3), sqrt(18)) over Q. | 0 | 4 | 2 | 6 | B |
Let p = (1, 2, 5, 4)(2, 3) in S_5 . Find the index of <p> in S_5. | 8 | 2 | 24 | 120 | C |
Here is the task definition:
@task
def mmlu():
# read the dataset
= csv_dataset(
dataset "mmlu.csv",
=record_to_sample
sample_fields
)
# task with multiple choice() and choice() scorer
return Task(
=task_dataset,
dataset=multiple_choice(),
solver=choice(),
scorer
)
def record_to_sample(record):
return Sample(
input=record["Question"],
=[
choicesstr(record["A"]),
str(record["B"]),
str(record["C"]),
str(record["D"]),
],=record["Answer"],
target )
We use the record_to_sample()
function to read the choices
along with the target
(which should always be a letter ,e.g. A, B, C, or D). Note that you should not include letter prefixes in the choices
, as they will be included automatically when presenting the question to the model.
Options
The following options are available for further customisation of the multiple choice solver:
Option | Description |
---|---|
template |
Use template to provide an alternate prompt template (note that if you do this your template should handle prompting for multiple_correct directly if required). You can access the built in templates using the MultipleChoiceTemplate enum. |
cot |
Whether the solver should perform chain-of-thought reasoning before answering (defaults to False ). NOTE: this has no effect if you provide a custom template. |
multiple_correct |
By default, multiple choice questions have a single correct answer. Set multiple_correct=True if your target has defined multiple correct answers (for example, a target of ["B", "C"] ). In this case the model is prompted to provide one or more answers, and the sample is scored correct only if each of these answers are provided. NOTE: this has no effect if you provide a custom template. |
shuffle |
If you specify shuffle=True , then the order of the answers presented to the model will be randomised (this may or may not affect results, depending on the nature of the questions and the model being evaluated). |
Self Critique
Here is the declaration for the self_critique()
solver:
def self_critique(
str | None = None,
critique_template: str | None = None,
completion_template: str | Model | None = None,
model: -> Solver: )
There are two templates which correspond to the one used to solicit critique and the one used to play that critique back for a refined answer (default templates are provided for both).
You will likely want to experiment with using a distinct model
for generating critiques (by default the model being evaluated is used).
Custom Solvers
In this section we’ll take a look at the source code for a couple of the built in solvers as a jumping off point for implementing your own solvers. A solver is an implementation of the Solver
protocol (a function that transforms a TaskState
):
async def solve(state: TaskState, generate: Generate) -> TaskState:
# do something useful with state, possibly calling generate()
# for more advanced solvers
return state
Typically solvers can be customised with parameters (e.g. template
for prompt engineering solvers). This means that a Solver
is actually a function which returns the solve()
function referenced above (this will become more clear in the examples below).
Task States
Before presenting the examples we’ll take a more in-depth look at the TaskState
class. Task states consist of both lower level data members (e.g. messages
, output
) as well as a number of convenience properties. The core members of TaskState
that are modified by solvers are messages
/ user_prompt
and output
:
Member | Type | Description |
---|---|---|
messages |
list[ChatMessage] | Chat conversation history for sample. It is automatically appended to by the generate() solver, and is often manipulated by other solvers (e.g. for prompt engineering or elicitation). |
user_prompt |
ChatMessageUser | Convenience property for accessing the first user message in the message history (commonly used for prompt engineering). |
output |
ModelOutput | The ‘final’ model output once we’ve completed all solving. This field is automatically updated with the last “assistant” message by the generate() solver. |
Note that the generate()
solver automatically updates both the messages
and output
fields. For very simple evaluations modifying the user_prompt
and then calling generate()
encompasses all of the required interaction with TaskState
.
Sometimes its important to have access to the original prompt input for the task (as other solvers may have re-written or even removed it entirely). This is available using the input
and input_text
properties:
Member | Type | Description |
---|---|---|
input |
str | list[ChatMessage] | Original Sample input. |
input_text |
str | Convenience function for accessing the initial input from the Sample as a string. |
There are several other fields used to provide contextual data from either the task sample or evaluation:
Member | Type | Description |
---|---|---|
sample_id |
int | str | Unique ID for sample. |
epoch |
int | Epoch for sample. |
metadata |
dict | Original metadata from Sample |
choices |
list[str] | None | Choices from sample (used only in multiple-choice evals). |
model |
ModelName | Name of model currently being evaluated. |
Task states also include available tools as well as guidance for the model on which tools to use (if you haven’t yet encountered the concept of tool use in language models, don’t worry about understanding these fields, the Tools article provides a more in-depth treatment):
Member | Type | Description |
---|---|---|
tools |
list[Tool] | Tools available to the model |
tool_choice |
ToolChoice | Tool choice directive. |
These fields are typically modified via the use_tools()
solver, but they can also be modified directly for more advanced use cases.
Example: Prompt Template
Here’s the code for the prompt_template()
solver:
@solver
def prompt_template(template: str, **params: dict[str, Any]):
# determine the prompt template
= resource(template)
prompt_template
async def solve(state: TaskState, generate: Generate) -> TaskState:
= state.user_prompt
prompt = state.metadata | params
kwargs = prompt_template.format(prompt=prompt.text, **kwargs)
prompt.text return state
return solve
A few things to note about this implementation:
The function applies the
@solver
decorator—this registers theSolver
with Inspect, making it possible to capture its name and parameters for logging, as well as make it callable from a configuration file (e.g. a YAML specification of an eval).The
solve()
function is declared asasync
. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this solver doesn’t callgenerate()
but others will).The
resource()
function is used to read the specifiedtemplate
. This function accepts a string, file, or URL as its argument, and then returns a string with the contents of the resource.We make use of the
user_prompt
property on theTaskState
. This is a convenience property for locating the firstrole="user"
message (otherwise you might need to skip over system messages, etc). Since this is a string templating solver, we use thestate.user_prompt.text
property (so we are dealing with prompt as a string, recall that it can also be a list of messages).We make sample
metadata
available to the template as well as anyparams
passed to the function.
Example: Self Critique
Here’s the code for the self_critique()
solver:
= r"""
DEFAULT_CRITIQUE_TEMPLATE Given the following question and answer, please critique the answer.
A good answer comprehensively answers the question and NEVER refuses
to answer. If the answer is already correct do not provide critique
- simply respond 'The original answer is fully correct'.
[BEGIN DATA]
***
[Question]: {question}
***
[Answer]: {completion}
***
[END DATA]
Critique: """
= r"""
DEFAULT_CRITIQUE_COMPLETION_TEMPLATE Given the following question, initial answer and critique please
generate an improved answer to the question:
[BEGIN DATA]
***
[Question]: {question}
***
[Answer]: {completion}
***
[Critique]: {critique}
***
[END DATA]
If the original answer is already correct, just repeat the
original answer exactly. You should just provide your answer to
the question in exactly this format:
Answer: <your answer> """
@solver
def self_critique(
str | None = None,
critique_template: str | None = None,
completion_template: str | Model | None = None,
model: -> Solver:
) # resolve templates
= resource(
critique_template or DEFAULT_CRITIQUE_TEMPLATE
critique_template
)= resource(
completion_template or DEFAULT_CRITIQUE_COMPLETION_TEMPLATE
completion_template
)
# resolve critique model
= get_model(model)
model
async def solve(state: TaskState, generate: Generate) -> TaskState:
# run critique
= await model.generate(
critique format(
critique_template.=state.input_text,
question=state.output.completion,
completion
)
)
# add the critique as a user message
state.messages.append(
ChatMessageUser(=completion_template.format(
content=state.input_text,
question=state.output.completion,
completion=critique.completion,
critique
),
)
)
# regenerate
return await generate(state)
return solve
Note that calls to generate()
(for both the critique model and the model being evaluated) are called with await
—this is critical to ensure that the solver participates correctly in the scheduling of generation work.
Scoring in Solvers
The solver-based scoring feature described below is currently available only in the development version of Inspect. To install the development version from GitHub:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
Typically, solvers don’t score samples but rather leave that to externally specified scorers. However, in some cases it is more convenient to have solvers also do scoring (e.g. when there is high coupling between the solver and scoring). The following two task state fields can be used for scoring:
Member | Type | Description |
---|---|---|
target |
Target | Scoring target from Sample |
scores |
dict[str, Score] | Optional scores. |
async def solve(state: TaskState, generate: Generate):
# ...perform solver work
# score
= state.output.completion == state.target.text
correct = { "correct": Score(value=correct) }
state.scores return state
Note that scores yielded by a Solver
are combined with scores from the normal scoring provided by the scorer(s) defined for a Task
.
Intermediate Scoring
In some cases it is useful for a solver to score a task directly to generate an intermediate score or assist in deciding whether or how to continue. You can do this using the score()
function:
from inspect_ai.scorer import score
def solver_that_scores() -> Solver:
async def solve(state: TaskState, generate: Generate) -> TaskState:
# use score(s) to determine next step
= await score(state)
scores
return state
return solver
Note that the score()
function returns a list of Score
(as its possible that a task could have multiple scorers).
Concurrency
When creating custom solvers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your solver is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism for a more in depth discussion.
Early Termination
In some cases a solver has the context available to request an early termination of the sample (i.e. don’t call the rest of the solvers). In this case, setting the TaskState.completed
field will result in forgoing remaining solvers. For example, here’s a simple solver that terminates the sample early:
@solver
def complete_task():
async def solve(state: TaskState, generate: Generate):
= True
state.completed return state
return solve
Early termination might also occur if you specify the message_limit
option and the conversation exceeds that limit:
# could terminate early
eval(my_task, message_limit = 10)