Scorers
Overview
Scorers evaluate whether solvers were successful in finding the right output
for the target
defined in the dataset, and in what measure. Scorers generally take one of the following forms:
Extracting a specific answer out of a model’s completion output using a variety of heuristics.
Applying a text similarity algorithm to see if the model’s completion is close to what is set out in the
target
.Using another model to assess whether the model’s completion satisfies a description of the ideal answer in
target
.Using another rubric entirely (e.g. did the model produce a valid version of a file format, etc.)
Scorers also define one or more metrics which are used to aggregate scores (e.g. accuracy()
which computes what percentage of scores are correct, or mean()
which provides an average for scores that exist on a continuum).
Built-In Scorers
Inspect includes some simple text matching scorers as well as a couple of model graded scorers. Built in scorers can be imported from the inspect_ai.scorer
module. Below is a summary of these scorers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
includes()
Determine whether the
target
from theSample
appears anywhere inside the model output. Can be case sensitive or insensitive (defaults to the latter).match()
Determine whether the
target
from theSample
appears at the beginning or end of model output (defaults to looking at the end). Has options for ignoring case, white-space, and punctuation (all are ignored by default).pattern()
Extract the answer from model output using a regular expression.
answer()
Scorer for model output that preceded answers with “ANSWER:”. Can extract letters, words, or the remainder of the line.
exact()
Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return
CORRECT
when the answer is an exact match to one or more targets.f1()
Scorer which computes the
F1
score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).model_graded_qa()
Have another model assess whether the model output is a correct answer based on the grading guidance contained in
target
. Has a built-in template that can be customised.model_graded_fact()
Have another model assess whether the model output contains a fact that is set out in
target
. This is a more narrow assessment thanmodel_graded_qa()
, and is used when model output is too complex to be assessed using a simplematch()
orpattern()
scorer.choices()
Specialised scorer that is used with the
multiple_choice()
solver.
Scorers provide one or more built-in metrics (each of the scorers above provides accuracy
and stderr
as a metric). You can also provide your own custom metrics in Task
definitions. For example:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=match(),
scorer=[custom_metric()]
metrics )
The current development version of Inspect replaces the use of the bootstrap_std
metric with stderr
for the built in scorers enumerated above.
Since eval scores are means of numbers having finite variance, we can compute standard errors using the Central Limit Theorem rather than bootstrapping. Bootstrapping is generally useful in contexts with more complex structure or non-mean summary statistics (e.g. quantiles). You will notice that the bootstrap numbers will come in quite close to the analytic numbers, since they are estimating the same thing.
A common misunderstanding is that “t-tests require the underlying data to be normally distributed”. This is only true for small-sample problems; for large sample problems (say 30 or more questions), you just need finite variance in the underlying data and the CLT guarantees a normally distributed mean value.
Model Graded
Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).
Here is the declaration for the model_graded_qa()
function:
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
str | None = None,
template: str | None = None,
instructions: str | None = None,
grade_pattern: bool | Callable[[TaskState], str] = False,
include_history: bool = False,
partial_credit: list[str | Model] | str | Model | None = None,
model: -> Scorer:
) ...
The default model graded QA scorer is tuned to grade answers to open ended questions. The default template
and instructions
ask the model to produce a grade in the format GRADE: C
or GRADE: I
, and this grade is extracted using the default grade_pattern
regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:
- Provide alternate
instructions
—the default instructions ass the model to use chain of thought reasoning and provide grades in the formatGRADE: C
orGRADE: I
. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise thegrade_pattern
. - Specify
include_history = True
to include the full chat history in the presented question (by default only the original sample input is presented). You may optionally instead pass a function that enables customising the presentation of the chat history. - Specify
partial_credit = True
to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the defaultinstructions
. - Specify an alternate
model
to perform the grading (e.g. a more powerful model or a model fine tuned for grading). - Specify a different
template
—note that templates are passed these variables:question
,criterion
,answer
, andinstructions.
The model_graded_fact()
scorer works identically to model_graded_qa()
, and simply provides an alternate template
oriented around judging whether a fact is included in the model output.
If you want to understand how the default templates for model_graded_qa()
and model_graded_fact()
work, see their source code.
Multiple Models
The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:
model_graded_qa(= [
model "google/gemini-1.0-pro",
"anthropic/claude-3-opus-20240229"
"together/meta-llama/Llama-3-70b-chat-hf",
] )
The implementation of multiple grader models takes advantage of the multi_scorer()
and majority_vote()
functions, both of which can be used in your own scorers (as described in the Multiple Scorers section below).
Custom Scorers
Custom scorers are functions that take a TaskState
and Target
, and yield a Score
.
async def score(state: TaskState, target: Target):
# Compare state / model output with target
# to yield a score
return Score(value=...)
First we’ll talk about the core Score
and Value
objects, then provide some examples of custom scorers to make things more concrete.
Note that score()
above is declared as an async
function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.
Score
The components of Score
include:
Field | Type | Description |
---|---|---|
value |
Value |
Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value). |
answer |
str |
Text extracted from model output for comparison (optional). |
explanation |
str |
Explanation of score, e.g. full model output or grader model output (optional). |
metadata |
dict[str,Any] |
Additional metadata about the score to record in the log file (optional). |
For example, the following are all valid Score
objects:
="C")
Score(value="I")
Score(value=0.6)
Score(value
Score(="C" if extracted == target.text else "I",
value=extracted,
answer=state.output.completion
explanation )
If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer
as part of your Score
, as this makes it much easier to understand the details of scoring when viewing the eval log file.
Value
Value
is union over the main scalar types as well as a list
or dict
of the same types:
= Union[
Value str | int | float | bool,
list[str | int | float | bool],
dict[str, str | int | float | bool],
]
The vast majority of scorers will use str
(e.g. for correct/incorrect via “C” and “I”) or float
(the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value
type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).
Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scorer Workflow section below for tips on optimising your development process.
Example: Includes
Here is the source code for the built-in includes()
scorer:
1@scorer(metrics=[accuracy(), stderr()])
def includes(ignore_case: bool = True):
2async def score(state: TaskState, target: Target):
# check for correct
= state.output.completion
answer 3= target.text
target if ignore_case:
= answer.lower().rfind(target.lower()) != -1
correct else:
= answer.rfind(target) != -1
correct
# return score
return Score(
4= CORRECT if correct else INCORRECT,
value 5=answer
answer
)
return score
- 1
-
The function applies the
@scorer
decorator and registers two metrics for use with the scorer. - 2
-
The
score()
function is declared asasync
. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will). - 3
-
We make use of the
text
property on theTarget
. This is a convenience property to get a simple text value out of theTarget
(as targets can technically be a list of strings). - 4
-
We use the special constants
CORRECT
andINCORRECT
for the score value (as theaccuracy()
,stderr()
, andbootstrap_std()
metrics know how to convert these special constants to float values (1.0 and 0.0 respectively). - 5
-
We provide the full model completion as the answer for the score (
answer
is optional, but highly recommended as it is often useful to refer to during evaluation development).
Example: Model Grading
Here’s a somewhat simplified version of the code for the model_graded_qa()
scorer:
@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
template: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
instructions: str = DEFAULT_GRADE_PATTERN,
grade_pattern: str | Model | None = None,
model: -> Scorer:
)
# resolve grading template and instructions,
# (as they could be file paths or URLs)
= resource(template)
template = resource(instructions)
instructions
# resolve model
= get_model(model)
grader_model
async def score(state: TaskState, target: Target) -> Score:
# format the model grading template
= template.format(
score_prompt =state.input_text,
question=state.output.completion,
answer=target.text,
criterion=instructions,
instructions
)
# query the model for the score
= await grader_model.generate(score_prompt)
result
# extract the grade
= re.search(grade_pattern, result.completion)
match if match:
return Score(
=match.group(1),
value=match.group(0),
answer=result.completion,
explanation
)else:
return Score(
=INCORRECT,
value="Grade not found in model output: "
explanation+ f"{result.completion}",
)
return score
Note that the call to model_grader.generate()
is done with await
—this is critical to ensure that the scorer participates correctly in the scheduling of generation work.
Note also we use the input_text
property of the TaskState
to access a string version of the original user input to substitute it into the grading template. Using the input_text
has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages
); and (2) It normalises the input to a string (as it could have been a message list).
Multiple Scorers
There are several ways to use multiple scorers in an evaluation:
- You can provide a list of scorers in a
Task
definition (this is the best option when scorers are entirely independent) - You can yield multiple scores from a
Scorer
(this is the best option when scores share code and/or expensive computations). - You can use multiple scorers and then aggregate them into a single scorer (e.g. majority voting).
List of Scorers
Task
definitions can specify multiple scorers. For example, the below task will use two different models to grade the results, storing two scores with each sample, one for each of the two models:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
generate()
],=[
scorer="openai/gpt-4"),
model_graded_qa(model="google/gemini-1.5-pro")
model_graded_qa(model
], )
This is useful when there is more than one way to score a result and you would like preserve the individual score values with each sample (versus reducing the multiple scores to a single value).
Scorer with Multiple Values
You may also create a scorer which yields multiple scores. This is useful when the scores use data that is shared or expensive to compute. For example:
@scorer(
1={
metrics"a_count": [mean(), stderr()],
"e_count": [mean(), stderr()]
}
)def letter_count():
async def score(state: TaskState, target: Target):
= state.output.completion
answer = answer.count("a")
a_count = answer.count("e")
e_count 2return Score(
={"a_count": a_count, "e_count": e_count},
value=answer
answer
)
return score
= Task(
task =[Sample(input="Tell me a story."],
dataset=letter_count()
scorer )
- 1
- The metrics for this scorer are a dictionary—this defines metrics to be applied to scores (by name).
- 2
-
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the
@scorer
decorator.
The above example will produce two scores, a_count
and e_count
, each of which will have metrics for mean
and stderr
.
Scorer with Complex Metrics
Sometime, it is useful for a scorer to compute multiple values (returning a dictionary as the score value) and to have metrics computed both for each key in the score dictionary, but also for the dictionary as a whole. For example:
@scorer(
1=[{
metrics"a_count": [mean(), stderr()],
"e_count": [mean(), stderr()]
}, total_count()]
)def letter_count():
async def score(state: TaskState, target: Target):
= state.output.completion
answer = answer.count("a")
a_count = answer.count("e")
e_count 2return Score(
={"a_count": a_count, "e_count": e_count},
value=answer
answer
)
return score
@metric
def total_count() -> Metric:
def metric(scores: list[Score]) -> int | float:
= 0.0
total for score in scores:
3= score.value["a_count"]
total + score.value["e_count"]
return total
return metric
= Task(
task =[Sample(input="Tell me a story."],
dataset=letter_count()
scorer )
- 1
- The metrics for this scorer are a list, one element is a dictionary—this defines metrics to be applied to scores (by name), the other element is a Metric which will receive the entire score dictionary.
- 2
-
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the
@scorer
decorator. - 3
-
The
total_count
metric will compute a metric based upon the entire score dictionary (since it isn’t being mapped onto the dictionary by key)
Reducing Multiple Scores
It’s possible to use multiple scorers in parallel, then reduce their output into a final overall score. This is done using the multi_scorer()
function. For example, this is roughly how the built in model graders use multiple models for grading:
multi_scorer(= [model_graded_qa(model=model) for model in models],
scorers = "mode"
reducer )
Use of multi_scorer()
requires both a list of scorers as well as a reducer which determines how a list of scores will be turned into a single score. In this case we use the “mode” reducer which returns the score that appeared most frequently in the answers.
Sandbox Access
If your Solver is an Agent with tool use, you might want to inspect the contents of the tool sandbox to score the task.
The contents of the sandbox for the Sample are available to the scorer; simply call await sandbox().read_file()
(or .exec()
).
For example:
from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, accuracy, scorer
from inspect_ai.solver import Plan, TaskState, generate, use_tools
from inspect_ai.tool import bash
from inspect_ai.util import sandbox
@scorer(metrics=[accuracy()])
def check_file_exists():
async def score(state: TaskState, target: Target):
try:
= await sandbox().read_file(target.text)
_ = True
exists except FileNotFoundError:
= False
exists return Score(value=1 if exists else 0)
return score
@task
def challenge() -> Task:
return Task(
=[
dataset
Sample(input="Create a file called hello-world.txt",
="hello-world.txt",
target
)
],=[use_tools([bash()]), generate()],
solver="local",
sandbox=check_file_exists(),
scorer )
Scoring Metrics
Each scorer provides one or more built-in metrics (typically accuracy
and stderr
) corresponding to the most typically useful metrics for that scorer.
You can override scorer’s built-in metrics by passing an alternate list of metrics
to the Task
. For example:
Task(=dataset,
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice()
],=choice(),
scorer=[custom_metric()]
metrics )
If you still want to compute the built-in metrics, we re-specify them along with the custom metrics:
=[accuracy(), stderr(), custom_metric()] metrics
Built-In Metrics
Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer
module. Below is a summary of these metrics. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.
accuracy()
Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.
mean()
Mean of all scores.
var()
Variance over all scores.
std()
Sample standard deviation of all scores.
stderr()
Standard error of the mean.
bootstrap_std()
Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the
num_samples
option).
Custom Metrics
You can also add your own metrics with @metric
decorated functions. For example, here is the implementation of the variance metric:
import numpy as np
from inspect_ai.scorer import Metric, Score, metric
@metric
def var() -> Metric:
"""Compute variance over all scores."""
def metric(scores: list[Score]) -> float:
return np.var([score.as_float() for score in scores]).item()
return metric
Note that the Score
class contains a Value
that is a union over several scalar and collection types. As a convenience, Score
includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float()
accessor).
Reducing Epochs
If a task is run over more than one epoch
, multiple scores will be generated for each sample. These scores are then reduced to a single score representing the score for the sample across all the epochs.
By default, this is done by taking the mean of all sample scores, but you may specify other strategies for reducing the samples by passing an Epochs
, which includes both a count and one or more reducers to combine sample scores with. For example:
@task
def gpqa():
return Task(
=read_gpqa_dataset("gpqa_main.csv"),
dataset=[
solver
system_message(SYSTEM_MESSAGE),
multiple_choice(),
],=choice(),
scorer=Epochs(5, "mode"),
epochs )
You may also specify more than one reducer which will compute metrics using each of the reducers. For example:
@task
def gpqa():
return Task(
...=Epochs(5, ["at_least_2", "at_least_5"]),
epochs )
Built-in Reducers
Inspect includes several built in reducers which are summarised below.
Reducer | Description |
---|---|
mean | Reduce to the average of all scores. |
median | Reduce to the median of all scores |
mode | Reduce to the most common score. |
max | Reduce to the maximum of all scores. |
pass_at_{k} | Probability of at least 1 correct sample given k epochs (https://arxiv.org/pdf/2107.03374) |
at_least_{k} | 1 if at least k samples are correct, else 0 . |
The built in reducers will compute a reduced value
for the score and populate the fields answer
and explanation
only if their value is equal across all epochs. The metadata
field will always be reduced to the value of metadata
in the first epoch. If your custom metrics function needs differing behavior for reducing fields, you should also implement your own custom reducer and merge or preserve fields in some way.
Custom Reducers
You can also add your own reducer with @score_reducer
decorated functions. Here’s a somewhat simplified version of the code for the mean
reducer:
import statistics
from inspect_ai.scorer import Score, ScoreReducer, score_reducer
@score_reducer(name="mean")
def mean_score() -> ScoreReducer:
def reduce(scores: list[Score]) -> Score:
"""Compute a mean value of all scores."""
= [float(score.value) for score in scores]
values = statistics.mean(values)
mean_value
return Score(value=mean_value)
return reduce
Workflow
Score Command
By default, model output in evaluations is automatically scored. However, you can separate generation and scoring by using the --no-score
option. For example:
inspect eval popularity.py --model openai/gpt-4 --no-score
You can score an evaluation previously run this way using the inspect score
command:
# score last eval
inspect score popularity.py
# score specific log file
inspect score popularity.py ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.json
Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.
Log Overwriting
By default, inspect score
overwrites the file it scores. If don’t want to overwrite target files, pass the --no-overwrite
flag:
inspect score popularity.py --no-overwrite
When specifying --no-overwrite
, a -scored
suffix will be added to the original log file name:
./logs/2024-02-23_task_gpt-4_TUhnCn473c6-scored.json
Note that the --no-overwrite
flag does not apply to log files that already have the -scored
suffix—those files are always overwritten by inspect score
. If you plan on scoring multiple times and you want to save each scoring output, you will want to copy the log to another location before re-scoring.
Python API
If you are exploring the performance of different scorers, you might find it more useful to call the score()
function using varying scorers or scorer options. For example:
= eval(popularity, model="openai/gpt-4")[0]
log
= [
grader_models "openai/gpt-4",
"anthropic/claude-3-opus-20240229",
"google/gemini-1.0-pro",
"mistral/mistral-large-latest"
]
= [score(log, model_graded_qa(model=model))
scoring_logs for model in grader_models]
plot_results(scoring_logs)