Scorers
Overview
Scorers evaluate whether solvers were successful in finding the right output
for the target
defined in the dataset, and in what measure. Scorers generally take one of the following forms:
Extracting a specific answer out of a model’s completion output using a variety of heuristics.
Applying a text similarity algorithm to see if the model’s completion is close to what is set out in the
target
.Using another model to assess whether the model’s completion satisfies a description of the ideal answer in
target
.Using another rubric entirely (e.g. did the model produce a valid version of a file format, etc.)
Scorers also define one or more metrics which are used to aggregate scores (e.g. accuracy() which computes what percentage of scores are correct, or mean() which provides an average for scores that exist on a continuum).
Built-In Scorers
Inspect includes some simple text matching scorers as well as a couple of model graded scorers. Built in scorers can be imported from the inspect_ai.scorer
module. Below is a summary of these scorers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
-
Determine whether the
target
from the Sample appears anywhere inside the model output. Can be case sensitive or insensitive (defaults to the latter). -
Determine whether the
target
from the Sample appears at the beginning or end of model output (defaults to looking at the end). Has options for ignoring case, white-space, and punctuation (all are ignored by default). -
Extract the answer from model output using a regular expression.
-
Scorer for model output that preceded answers with “ANSWER:”. Can extract letters, words, or the remainder of the line.
-
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return
CORRECT
when the answer is an exact match to one or more targets. -
Scorer which computes the
F1
score for the answer (which balances recall precision by taking the harmonic mean between recall and precision). -
Have another model assess whether the model output is a correct answer based on the grading guidance contained in
target
. Has a built-in template that can be customised. -
Have another model assess whether the model output contains a fact that is set out in
target
. This is a more narrow assessment than model_graded_qa(), and is used when model output is too complex to be assessed using a simple match() or pattern() scorer. choices()
Specialised scorer that is used with the multiple_choice() solver.
Scorers provide one or more built-in metrics (each of the scorers above provides accuracy
and stderr
as a metric). You can also provide your own custom metrics in Task definitions. For example:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=match(),
scorer=[custom_metric()]
metrics )
The current development version of Inspect replaces the use of the bootstrap_stderr
metric with stderr
for the built in scorers enumerated above.
Since eval scores are means of numbers having finite variance, we can compute standard errors using the Central Limit Theorem rather than bootstrapping. Bootstrapping is generally useful in contexts with more complex structure or non-mean summary statistics (e.g. quantiles). You will notice that the bootstrap numbers will come in quite close to the analytic numbers, since they are estimating the same thing.
A common misunderstanding is that “t-tests require the underlying data to be normally distributed”. This is only true for small-sample problems; for large sample problems (say 30 or more questions), you just need finite variance in the underlying data and the CLT guarantees a normally distributed mean value.
Model Graded
Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).
Here is the declaration for the model_graded_qa() function:
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
str | None = None,
template: str | None = None,
instructions: str | None = None,
grade_pattern: bool | Callable[[TaskState], str] = False,
include_history: bool = False,
partial_credit: list[str | Model] | str | Model | None = None,
model: -> Scorer:
) ...
The default model graded QA scorer is tuned to grade answers to open ended questions. The default template
and instructions
ask the model to produce a grade in the format GRADE: C
or GRADE: I
, and this grade is extracted using the default grade_pattern
regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:
- Provide alternate
instructions
—the default instructions ask the model to use chain of thought reasoning and provide grades in the formatGRADE: C
orGRADE: I
. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise thegrade_pattern
. - Specify
include_history = True
to include the full chat history in the presented question (by default only the original sample input is presented). You may optionally instead pass a function that enables customising the presentation of the chat history. - Specify
partial_credit = True
to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the defaultinstructions
. - Specify an alternate
model
to perform the grading (e.g. a more powerful model or a model fine tuned for grading). - Specify a different
template
—note that templates are passed these variables:question
,criterion
,answer
, andinstructions.
The model_graded_fact() scorer works identically to model_graded_qa(), and simply provides an alternate template
oriented around judging whether a fact is included in the model output.
If you want to understand how the default templates for model_graded_qa() and model_graded_fact() work, see their source code.
Multiple Models
The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:
model_graded_qa(= [
model "google/gemini-1.0-pro",
"anthropic/claude-3-opus-20240229"
"together/meta-llama/Llama-3-70b-chat-hf",
] )
The implementation of multiple grader models takes advantage of the multi_scorer() and majority_vote()
functions, both of which can be used in your own scorers (as described in the Multiple Scorers section below).
Custom Scorers
Custom scorers are functions that take a TaskState and Target, and yield a Score.
async def score(state: TaskState, target: Target):
# Compare state / model output with target
# to yield a score
return Score(value=...)
First we’ll talk about the core Score and Value objects, then provide some examples of custom scorers to make things more concrete.
Note that score
above is declared as an async
function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.
Score
The components of Score include:
Field | Type | Description |
---|---|---|
value |
Value | Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value). |
answer |
str |
Text extracted from model output for comparison (optional). |
explanation |
str |
Explanation of score, e.g. full model output or grader model output (optional). |
metadata |
dict[str,Any] |
Additional metadata about the score to record in the log file (optional). |
For example, the following are all valid Score objects:
="C")
Score(value="I")
Score(value=0.6)
Score(value
Score(="C" if extracted == target.text else "I",
value=extracted,
answer=state.output.completion
explanation )
If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer
as part of your Score, as this makes it much easier to understand the details of scoring when viewing the eval log file.
Value
Value is union over the main scalar types as well as a list
or dict
of the same types:
= Union[
Value str | int | float | bool,
list[str | int | float | bool],
dict[str, str | int | float | bool],
]
The vast majority of scorers will use str
(e.g. for correct/incorrect via “C” and “I”) or float
(the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).
Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scorer Workflow section below for tips on optimising your development process.
Models in Scorers
You’ll often want to use models in the implementation of scorers. Use the get_model() function to get either the currently evaluated model or another model interface. For example:
# use the model being evaluated for grading
= get_model()
grader_model
# use another model for grading
= get_model("google/gemini-1.5-pro") grader_model
Use the config
parameter of get_model() to override default generation options:
= get_model(
grader_model "google/gemini-1.5-pro",
= GenerateConfig(temperature = 0.9, max_connections = 10)
config )
Example: Includes
Here is the source code for the built-in includes() scorer:
1@scorer(metrics=[accuracy(), stderr()])
def includes(ignore_case: bool = True):
2async def score(state: TaskState, target: Target):
# check for correct
= state.output.completion
answer 3= target.text
target if ignore_case:
= answer.lower().rfind(target.lower()) != -1
correct else:
= answer.rfind(target) != -1
correct
# return score
return Score(
4= CORRECT if correct else INCORRECT,
value 5=answer
answer
)
return score
- 1
-
The function applies the
@scorer
decorator and registers two metrics for use with the scorer. - 2
-
The
score
function is declared asasync
. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will). - 3
-
We make use of the
text
property on the Target. This is a convenience property to get a simple text value out of the Target (as targets can technically be a list of strings). - 4
-
We use the special constants
CORRECT
andINCORRECT
for the score value (as the accuracy(), stderr(), and bootstrap_stderr() metrics know how to convert these special constants to float values (1.0 and 0.0 respectively). - 5
-
We provide the full model completion as the answer for the score (
answer
is optional, but highly recommended as it is often useful to refer to during evaluation development).
Example: Model Grading
Here’s a somewhat simplified version of the code for the model_graded_qa() scorer:
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
template: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
instructions: str = DEFAULT_GRADE_PATTERN,
grade_pattern: str | Model | None = None,
model: -> Scorer:
)
# resolve grading template and instructions,
# (as they could be file paths or URLs)
= resource(template)
template = resource(instructions)
instructions
# resolve model
= get_model(model)
grader_model
async def score(state: TaskState, target: Target) -> Score:
# format the model grading template
= template.format(
score_prompt =state.input_text,
question=state.output.completion,
answer=target.text,
criterion=instructions,
instructions
)
# query the model for the score
= await grader_model.generate(score_prompt)
result
# extract the grade
= re.search(grade_pattern, result.completion)
match if match:
return Score(
=match.group(1),
value=match.group(0),
answer=result.completion,
explanation
)else:
return Score(
=INCORRECT,
value="Grade not found in model output: "
explanation+ f"{result.completion}",
)
return score
Note that the call to model_grader.generate()
is done with await
—this is critical to ensure that the scorer participates correctly in the scheduling of generation work.
Note also we use the input_text
property of the TaskState to access a string version of the original user input to substitute it into the grading template. Using the input_text
has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages
); and (2) It normalises the input to a string (as it could have been a message list).
Multiple Scorers
There are several ways to use multiple scorers in an evaluation:
- You can provide a list of scorers in a Task definition (this is the best option when scorers are entirely independent)
- You can yield multiple scores from a Scorer (this is the best option when scores share code and/or expensive computations).
- You can use multiple scorers and then aggregate them into a single scorer (e.g. majority voting).
List of Scorers
Task definitions can specify multiple scorers. For example, the below task will use two different models to grade the results, storing two scores with each sample, one for each of the two models:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
generate()
],=[
scorer="openai/gpt-4"),
model_graded_qa(model="google/gemini-1.5-pro")
model_graded_qa(model
], )
This is useful when there is more than one way to score a result and you would like preserve the individual score values with each sample (versus reducing the multiple scores to a single value).
Scorer with Multiple Values
You may also create a scorer which yields multiple scores. This is useful when the scores use data that is shared or expensive to compute. For example:
@scorer(
1={
metrics"a_count": [mean(), stderr()],
"e_count": [mean(), stderr()]
}
)def letter_count():
async def score(state: TaskState, target: Target):
= state.output.completion
answer = answer.count("a")
a_count = answer.count("e")
e_count 2return Score(
={"a_count": a_count, "e_count": e_count},
value=answer
answer
)
return score
= Task(
task =[Sample(input="Tell me a story."],
dataset=letter_count()
scorer )
- 1
- The metrics for this scorer are a dictionary—this defines metrics to be applied to scores (by name).
- 2
-
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the
@scorer
decorator.
The above example will produce two scores, a_count
and e_count
, each of which will have metrics for mean
and stderr
.
When working with complex score values and metrics, you may use globs as keys for mapping metrics to scores. For example, a more succinct way to write the previous example:
@scorer(
={
metrics"*": [mean(), stderr()],
} )
Glob keys will each be resolved and a complete list of matching metrics will be applied to each score key. For example to compute mean
for all score keys, and only compute stderr
for e_count
you could write:
@scorer(
={
metrics"*": [mean()],
"e_count": [stderr()]
} )
Scorer with Complex Metrics
Sometime, it is useful for a scorer to compute multiple values (returning a dictionary as the score value) and to have metrics computed both for each key in the score dictionary, but also for the dictionary as a whole. For example:
@scorer(
1=[{
metrics"a_count": [mean(), stderr()],
"e_count": [mean(), stderr()]
}, total_count()]
)def letter_count():
async def score(state: TaskState, target: Target):
= state.output.completion
answer = answer.count("a")
a_count = answer.count("e")
e_count 2return Score(
={"a_count": a_count, "e_count": e_count},
value=answer
answer
)
return score
@metric
def total_count() -> Metric:
def metric(scores: list[SampleScore]) -> int | float:
= 0.0
total for score in scores:
3= score.score.value["a_count"]
total + score.score.value["e_count"]
return total
return metric
= Task(
task =[Sample(input="Tell me a story."],
dataset=letter_count()
scorer )
- 1
- The metrics for this scorer are a list, one element is a dictionary—this defines metrics to be applied to scores (by name), the other element is a Metric which will receive the entire score dictionary.
- 2
-
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the
@scorer
decorator. - 3
-
The
total_count
metric will compute a metric based upon the entire score dictionary (since it isn’t being mapped onto the dictionary by key)
Reducing Multiple Scores
It’s possible to use multiple scorers in parallel, then reduce their output into a final overall score. This is done using the multi_scorer() function. For example, this is roughly how the built in model graders use multiple models for grading:
multi_scorer(= [model_graded_qa(model=model) for model in models],
scorers = "mode"
reducer )
Use of multi_scorer() requires both a list of scorers as well as a reducer which determines how a list of scores will be turned into a single score. In this case we use the “mode” reducer which returns the score that appeared most frequently in the answers.
Sandbox Access
If your Solver is an Agent with tool use, you might want to inspect the contents of the tool sandbox to score the task.
The contents of the sandbox for the Sample are available to the scorer; simply call await sandbox().read_file()
(or .exec()
).
For example:
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, accuracy, scorer
from inspect_ai.solver import Plan, TaskState, generate, use_tools
from inspect_ai.tool import bash
from inspect_ai.util import sandbox
@scorer(metrics=[accuracy()])
def check_file_exists():
async def score(state: TaskState, target: Target):
try:
= await sandbox().read_file(target.text)
_ = True
exists except FileNotFoundError:
= False
exists return Score(value=1 if exists else 0)
return score
@task
def challenge() -> Task:
return Task(
=[
dataset
Sample(input="Create a file called hello-world.txt",
="hello-world.txt",
target
)
],=[use_tools([bash()]), generate()],
solver="local",
sandbox=check_file_exists(),
scorer )
Scoring Metrics
Each scorer provides one or more built-in metrics (typically accuracy
and stderr
) corresponding to the most typically useful metrics for that scorer.
You can override scorer’s built-in metrics by passing an alternate list of metrics
to the Task. For example:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=choice(),
scorer=[custom_metric()]
metrics )
If you still want to compute the built-in metrics, we re-specify them along with the custom metrics:
=[accuracy(), stderr(), custom_metric()] metrics
Built-In Metrics
Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer
module. Below is a summary of these metrics. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
-
Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.
-
Mean of all scores.
var()
Sample variance over all scores.
-
Standard deviation over all scores (see below for details on computing clustered standard errors).
-
Standard error of the mean.
-
Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the
num_samples
option).
Clustered Standard Errors
The stderr() metric supports computing clustered standard errors via the cluster
parameter. Most scorers already include stderr() as a built-in metric, so to compute clustered standard errors you’ll want to specify custom metrics
for your task (which will override the scorer’s built in metrics).
For example, let’s say you wanted to cluster on a “category” variable defined in Sample metadata:
@task
def gpqa():
return Task(
=read_gpqa_dataset("gpqa_main.csv"),
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],=choice(),
scorer=[accuracy(), stderr(cluster="category")]
metrics )
The metrics
passed to the Task override the default metrics of the choice() scorer.
Custom Metrics
You can also add your own metrics with @metric
decorated functions. For example, here is the implementation of the mean metric:
import numpy as np
from inspect_ai.scorer import Metric, Score, metric
@metric
def mean() -> Metric:
"""Compute mean of all scores.
Returns:
mean metric
"""
def metric(scores: list[SampleScore]) -> float:
return np.mean([score.score.as_float() for score in scores]).item()
return metric
Note that the Score class contains a Value that is a union over several scalar and collection types. As a convenience, Score includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float()
accessor).
Reducing Epochs
If a task is run over more than one epoch
, multiple scores will be generated for each sample. These scores are then reduced to a single score representing the score for the sample across all the epochs.
By default, this is done by taking the mean of all sample scores, but you may specify other strategies for reducing the samples by passing an Epochs, which includes both a count and one or more reducers to combine sample scores with. For example:
@task
def gpqa():
return Task(
=read_gpqa_dataset("gpqa_main.csv"),
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],=choice(),
scorer=Epochs(5, "mode"),
epochs )
You may also specify more than one reducer which will compute metrics using each of the reducers. For example:
@task
def gpqa():
return Task(
...=Epochs(5, ["at_least_2", "at_least_5"]),
epochs )
Built-in Reducers
Inspect includes several built in reducers which are summarised below.
Reducer | Description |
---|---|
mean | Reduce to the average of all scores. |
median | Reduce to the median of all scores |
mode | Reduce to the most common score. |
max | Reduce to the maximum of all scores. |
pass_at_{k} | Probability of at least 1 correct sample given k epochs (https://arxiv.org/pdf/2107.03374) |
at_least_{k} | 1 if at least k samples are correct, else 0 . |
The built in reducers will compute a reduced value
for the score and populate the fields answer
and explanation
only if their value is equal across all epochs. The metadata
field will always be reduced to the value of metadata
in the first epoch. If your custom metrics function needs differing behavior for reducing fields, you should also implement your own custom reducer and merge or preserve fields in some way.
Custom Reducers
You can also add your own reducer with @score_reducer
decorated functions. Here’s a somewhat simplified version of the code for the mean
reducer:
import statistics
from inspect_ai.scorer import (
Score, ScoreReducer, score_reducer, value_to_float
)
@score_reducer(name="mean")
def mean_score() -> ScoreReducer:
= value_to_float()
to_float
def reduce(scores: list[Score]) -> Score:
"""Compute a mean value of all scores."""
= [to_float(score.value) for score in scores]
values = statistics.mean(values)
mean_value
return Score(value=mean_value)
return reduce
Workflow
Unscored Evals
By default, model output in evaluations is automatically scored. However, you can defer scoring by using the --no-score
option. For example:
inspect eval popularity.py --model openai/gpt-4 --no-score
This will produce a log with samples that have not yet been scored and with no evaluation metrics.
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
Score Command
You can score an evaluation previously run this way using the inspect score
command:
# score an unscored eval
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval
This will use the scorers and metrics that were declared when the evaluation was run, applying them to score each sample and generate metrics for the evaluation.
You may choose to use a different scorer than the task scorer to score a log file. In this case, you can use the --scorer
option to pass the name of a scorer (including one in a package) or the path to a source code file containing a scorer to use. For example:
# use built in match scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match
# use scorer in a package
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer scorertools/custom_scorer
# use scorer in a file
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py
# use a custom scorer named 'classify' in a file with more than one scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorers.py@classify
If you need to pass arguments to the scorer, you can do do using scorer args (-S
) like so:
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer match -S location=end
Overwriting Logs
When you use the inspect score
command, you will prompted whether or not you’d like to overwrite the existing log file (with the scores added), or create a new scored log file. By default, the command will create a new log file with a -scored
suffix to distinguish it from the original file. You may also control this using the --overwrite
flag as follows:
# overwrite the log with scores from the task defined scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --overwrite
Ovewriting Scores
When rescoring a previously scored log file you have two options:
- Append Mode (Default): The new scores will be added alongside the existing scores in the log file, keeping both the old and new results.
- Overwrite Mode: The new scores will replace the existing scores in the log file, removing the old results.
You can choose which mode to use based on whether you want to preserve or discard the previous scoring data. To control this, use the --action
arg:
# append scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action append
# overwrite scores with new scores from custom scorer
inspect score ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.eval --scorer custom_scorer.py --action overwrite
Score Function
You can also use the score() function in your Python code to score evaluation logs. For example, if you are exploring the performance of different scorers, you might find it more useful to call the score() function using varying scorers or scorer options. For example:
= eval(popularity, model="openai/gpt-4")[0]
log
= [
grader_models "openai/gpt-4",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.0-pro",
"mistral/mistral-large-latest"
]
= [score(log, model_graded_qa(model=model))
scoring_logs for model in grader_models]
plot_results(scoring_logs)
You can also use this function to score an existing log file (appending or overwriting results) like so:
# read the log
= "./logs/2025-02-11T15-17-00-05-00_popularity_dPiJifoWeEQBrfWsAopzWr.eval"
input_log_path = read_eval_log(input_log_path)
log
= [
grader_models "openai/gpt-4",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.0-pro",
"mistral/mistral-large-latest"
]
# perform the scoring using various models
= [score(log, model_graded_qa(model=model), action="append")
scoring_logs for model in grader_models]
# write log files with the model name as a suffix
for model, scored_log in zip(grader_models, scoring_logs):
= os.path.splitext(input_log_path)
base, ext = f"{base}_{model.replace('/', '_')}{ext}"
output_file write_eval_log(scored_log, output_file)