Scorers

Overview

Scorers evaluate whether solvers were successful in finding the right output for the target defined in the dataset, and in what measure. Scorers generally take one of the following forms:

  1. Extracting a specific answer out of a model’s completion output using a variety of heuristics.

  2. Applying a text similarity algorithm to see if the model’s completion is close to what is set out in the target.

  3. Using another model to assess whether the model’s completion satisfies a description of the ideal answer in target.

  4. Using another rubric entirely (e.g. did the model produce a valid version of a file format, etc.)

Scorers also define one or more metrics which are used to aggregate scores (e.g. accuracy() which computes what percentage of scores are correct, or mean() which provides an average for scores that exist on a continuum).

Built-In Scorers

Inspect includes some simple text matching scorers as well as a couple of model graded scorers. Built in scorers can be imported from the inspect_ai.scorer module. Below is a summary of these scorers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.

  • includes()

    Determine whether the target from the Sample appears anywhere inside the model output. Can be case sensitive or insensitive (defaults to the latter).

  • match()

    Determine whether the target from the Sample appears at the beginning or end of model output (defaults to looking at the end). Has options for ignoring case, white-space, and punctuation (all are ignored by default).

  • pattern()

    Extract the answer from model output using a regular expression.

  • answer()

    Scorer for model output that preceded answers with “ANSWER:”. Can extract letters, words, or the remainder of the line.

  • exact()

    Scorer which will normalize the text of the answer and target(s) and perform an exact matching comparison of the text. This scorer will return CORRECT when the answer is an exact match to one or more targets.

  • f1()

    Scorer which computes the F1 score for the answer (which balances recall precision by taking the harmonic mean between recall and precision).

  • model_graded_qa()

    Have another model assess whether the model output is a correct answer based on the grading guidance contained in target. Has a built-in template that can be customised.

  • model_graded_fact()

    Have another model assess whether the model output contains a fact that is set out in target. This is a more narrow assessment than model_graded_qa(), and is used when model output is too complex to be assessed using a simple match() or pattern() scorer.

  • choices()

    Specialised scorer that is used with the multiple_choice() solver.

Scorers provide one or more built-in metrics (each of the scorers above provides accuracy and stderr as a metric). You can also provide your own custom metrics in Task definitions. For example:

Task(
    dataset=dataset,
    solver=[
        system_message(SYSTEM_MESSAGE),
        multiple_choice()
    ],
    scorer=match(),
    metrics=[custom_metric()]
)
Note

The current development version of Inspect replaces the use of the bootstrap_stderr metric with stderr for the built in scorers enumerated above.

Since eval scores are means of numbers having finite variance, we can compute standard errors using the Central Limit Theorem rather than bootstrapping. Bootstrapping is generally useful in contexts with more complex structure or non-mean summary statistics (e.g. quantiles). You will notice that the bootstrap numbers will come in quite close to the analytic numbers, since they are estimating the same thing.

A common misunderstanding is that “t-tests require the underlying data to be normally distributed”. This is only true for small-sample problems; for large sample problems (say 30 or more questions), you just need finite variance in the underlying data and the CLT guarantees a normally distributed mean value.

Model Graded

Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).

Here is the declaration for the model_graded_qa() function:

@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
    template: str | None = None,
    instructions: str | None = None,
    grade_pattern: str | None = None,
    include_history: bool | Callable[[TaskState], str] = False,
    partial_credit: bool = False,
    model: list[str | Model] | str | Model | None = None,
) -> Scorer:
    ...

The default model graded QA scorer is tuned to grade answers to open ended questions. The default template and instructions ask the model to produce a grade in the format GRADE: C or GRADE: I, and this grade is extracted using the default grade_pattern regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:

  1. Provide alternate instructions—the default instructions ask the model to use chain of thought reasoning and provide grades in the format GRADE: C or GRADE: I. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the grade_pattern.
  2. Specify include_history = True to include the full chat history in the presented question (by default only the original sample input is presented). You may optionally instead pass a function that enables customising the presentation of the chat history.
  3. Specify partial_credit = True to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the default instructions.
  4. Specify an alternate model to perform the grading (e.g. a more powerful model or a model fine tuned for grading).
  5. Specify a different template—note that templates are passed these variables: question, criterion, answer, and instructions.

The model_graded_fact() scorer works identically to model_graded_qa(), and simply provides an alternate template oriented around judging whether a fact is included in the model output.

If you want to understand how the default templates for model_graded_qa() and model_graded_fact() work, see their source code.

Multiple Models

The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:

model_graded_qa(
    model = [
        "google/gemini-1.0-pro",
        "anthropic/claude-3-opus-20240229" 
        "together/meta-llama/Llama-3-70b-chat-hf",
    ]
)

The implementation of multiple grader models takes advantage of the multi_scorer() and majority_vote() functions, both of which can be used in your own scorers (as described in the Multiple Scorers section below).

Custom Scorers

Custom scorers are functions that take a TaskState and Target, and yield a Score.

async def score(state: TaskState, target: Target):
     # Compare state / model output with target
     # to yield a score
     return Score(value=...)

First we’ll talk about the core Score and Value objects, then provide some examples of custom scorers to make things more concrete.

Note that score() above is declared as an async function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.

Score

The components of Score include:

Field Type Description
value Value Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value).
answer str Text extracted from model output for comparison (optional).
explanation str Explanation of score, e.g. full model output or grader model output (optional).
metadata dict[str,Any] Additional metadata about the score to record in the log file (optional).

For example, the following are all valid Score objects:

Score(value="C")
Score(value="I")
Score(value=0.6)
Score(
    value="C" if extracted == target.text else "I", 
    answer=extracted, 
    explanation=state.output.completion
)

If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer as part of your Score, as this makes it much easier to understand the details of scoring when viewing the eval log file.

Value

Value is union over the main scalar types as well as a list or dict of the same types:

Value = Union[
    str | int | float | bool,
    list[str | int | float | bool],
    dict[str, str | int | float | bool],
]

The vast majority of scorers will use str (e.g. for correct/incorrect via “C” and “I”) or float (the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).

Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scorer Workflow section below for tips on optimising your development process.

Example: Includes

Here is the source code for the built-in includes() scorer:

1@scorer(metrics=[accuracy(), stderr()])
def includes(ignore_case: bool = True):

2    async def score(state: TaskState, target: Target):

        # check for correct
        answer = state.output.completion
3        target = target.text
        if ignore_case:
            correct = answer.lower().rfind(target.lower()) != -1
        else:
            correct = answer.rfind(target) != -1

        # return score
        return Score(
4            value = CORRECT if correct else INCORRECT,
5            answer=answer
        )

    return score
1
The function applies the @scorer decorator and registers two metrics for use with the scorer.
2
The score() function is declared as async. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will).
3
We make use of the text property on the Target. This is a convenience property to get a simple text value out of the Target (as targets can technically be a list of strings).
4
We use the special constants CORRECT and INCORRECT for the score value (as the accuracy(), stderr(), and bootstrap_stderr() metrics know how to convert these special constants to float values (1.0 and 0.0 respectively).
5
We provide the full model completion as the answer for the score (answer is optional, but highly recommended as it is often useful to refer to during evaluation development).

Example: Model Grading

Here’s a somewhat simplified version of the code for the model_graded_qa() scorer:


@scorer(metrics=[accuracy(), stderr()])
def model_graded_qa(
    template: str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
    instructions: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
    grade_pattern: str = DEFAULT_GRADE_PATTERN,
    model: str | Model | None = None,
) -> Scorer:
   
    # resolve grading template and instructions, 
    # (as they could be file paths or URLs)
    template = resource(template)
    instructions = resource(instructions)

    # resolve model
    grader_model = get_model(model)

    async def score(state: TaskState, target: Target) -> Score:
        # format the model grading template
        score_prompt = template.format(
            question=state.input_text,
            answer=state.output.completion,
            criterion=target.text,
            instructions=instructions,
        )

        # query the model for the score
        result = await grader_model.generate(score_prompt)

        # extract the grade
        match = re.search(grade_pattern, result.completion)
        if match:
            return Score(
                value=match.group(1),
                answer=match.group(0),
                explanation=result.completion,
            )
        else:
            return Score(
                value=INCORRECT,
                explanation="Grade not found in model output: "
                + f"{result.completion}",
            )

    return score

Note that the call to model_grader.generate() is done with await—this is critical to ensure that the scorer participates correctly in the scheduling of generation work.

Note also we use the input_text property of the TaskState to access a string version of the original user input to substitute it into the grading template. Using the input_text has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages); and (2) It normalises the input to a string (as it could have been a message list).

Multiple Scorers

There are several ways to use multiple scorers in an evaluation:

  1. You can provide a list of scorers in a Task definition (this is the best option when scorers are entirely independent)
  2. You can yield multiple scores from a Scorer (this is the best option when scores share code and/or expensive computations).
  3. You can use multiple scorers and then aggregate them into a single scorer (e.g. majority voting).

List of Scorers

Task definitions can specify multiple scorers. For example, the below task will use two different models to grade the results, storing two scores with each sample, one for each of the two models:

Task(
    dataset=dataset,
    solver=[
        system_message(SYSTEM_MESSAGE),
        generate()
    ],
    scorer=[
        model_graded_qa(model="openai/gpt-4"), 
        model_graded_qa(model="google/gemini-1.5-pro")
    ],
)

This is useful when there is more than one way to score a result and you would like preserve the individual score values with each sample (versus reducing the multiple scores to a single value).

Scorer with Multiple Values

You may also create a scorer which yields multiple scores. This is useful when the scores use data that is shared or expensive to compute. For example:

@scorer(
1    metrics={
        "a_count": [mean(), stderr()],
        "e_count": [mean(), stderr()]
    }
)
def letter_count():
    async def score(state: TaskState, target: Target):
        answer = state.output.completion
        a_count = answer.count("a")
        e_count = answer.count("e")
2        return Score(
            value={"a_count": a_count, "e_count": e_count},
            answer=answer
        )

    return score

task = Task(
    dataset=[Sample(input="Tell me a story."],
    scorer=letter_count()
)
1
The metrics for this scorer are a dictionary—this defines metrics to be applied to scores (by name).
2
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the @scorer decorator.

The above example will produce two scores, a_count and e_count, each of which will have metrics for mean and stderr.

When working with complex score values and metrics, you may use globs as keys for mapping metrics to scores. For example, a more succinct way to write the previous example:

@scorer(
    metrics={
        "*": [mean(), stderr()], 
    }
)

Glob keys will each be resolved and a complete list of matching metrics will be applied to each score key. For example to compute mean for all score keys, and only compute stderr for e_count you could write:

@scorer(
    metrics={
        "*": [mean()], 
        "e_count": [stderr()]
    }
)

Scorer with Complex Metrics

Sometime, it is useful for a scorer to compute multiple values (returning a dictionary as the score value) and to have metrics computed both for each key in the score dictionary, but also for the dictionary as a whole. For example:

@scorer(
1    metrics=[{
        "a_count": [mean(), stderr()],
        "e_count": [mean(), stderr()]
    }, total_count()]
)
def letter_count():
    async def score(state: TaskState, target: Target):
        answer = state.output.completion
        a_count = answer.count("a")
        e_count = answer.count("e")
2        return Score(
            value={"a_count": a_count, "e_count": e_count},
            answer=answer
        )

    return score

@metric
def total_count() -> Metric:
    def metric(scores: list[Score]) -> int | float:
        total = 0.0
        for score in scores:
3            total = score.value["a_count"]
                + score.value["e_count"]
        return total
    return metric

task = Task(
    dataset=[Sample(input="Tell me a story."],
    scorer=letter_count()
)
1
The metrics for this scorer are a list, one element is a dictionary—this defines metrics to be applied to scores (by name), the other element is a Metric which will receive the entire score dictionary.
2
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the @scorer decorator.
3
The total_count metric will compute a metric based upon the entire score dictionary (since it isn’t being mapped onto the dictionary by key)

Reducing Multiple Scores

It’s possible to use multiple scorers in parallel, then reduce their output into a final overall score. This is done using the multi_scorer() function. For example, this is roughly how the built in model graders use multiple models for grading:

multi_scorer(
    scorers = [model_graded_qa(model=model) for model in models],
    reducer = "mode"
)

Use of multi_scorer() requires both a list of scorers as well as a reducer which determines how a list of scores will be turned into a single score. In this case we use the “mode” reducer which returns the score that appeared most frequently in the answers.

Sandbox Access

If your Solver is an Agent with tool use, you might want to inspect the contents of the tool sandbox to score the task.

The contents of the sandbox for the Sample are available to the scorer; simply call await sandbox().read_file() (or .exec()).

For example:

from inspect_ai import Task, task
from inspect_ai.dataset import Sample
from inspect_ai.scorer import Score, Target, accuracy, scorer
from inspect_ai.solver import Plan, TaskState, generate, use_tools
from inspect_ai.tool import bash
from inspect_ai.util import sandbox


@scorer(metrics=[accuracy()])
def check_file_exists():
    async def score(state: TaskState, target: Target):
        try:
            _ = await sandbox().read_file(target.text)
            exists = True
        except FileNotFoundError:
            exists = False
        return Score(value=1 if exists else 0)

    return score


@task
def challenge() -> Task:
    return Task(
        dataset=[
            Sample(
                input="Create a file called hello-world.txt",
                target="hello-world.txt",
            )
        ],
        solver=[use_tools([bash()]), generate()],
        sandbox="local",
        scorer=check_file_exists(),
    )

Scoring Metrics

Each scorer provides one or more built-in metrics (typically accuracy and stderr) corresponding to the most typically useful metrics for that scorer.

You can override scorer’s built-in metrics by passing an alternate list of metrics to the Task. For example:

Task(
    dataset=dataset,
    solver=[
        system_message(SYSTEM_MESSAGE),
        multiple_choice()
    ],
    scorer=choice(),
    metrics=[custom_metric()]
)

If you still want to compute the built-in metrics, we re-specify them along with the custom metrics:

metrics=[accuracy(), stderr(), custom_metric()]

Built-In Metrics

Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer module. Below is a summary of these metrics. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.

  • accuracy()

    Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.

  • mean()

    Mean of all scores.

  • var()

    Variance over all scores.

  • std()

    Sample standard deviation of all scores.

  • stderr()

    Standard error of the mean.

  • bootstrap_stderr()

    Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the num_samples option).

Custom Metrics

You can also add your own metrics with @metric decorated functions. For example, here is the implementation of the variance metric:

import numpy as np

from inspect_ai.scorer import Metric, Score, metric

@metric
def var() -> Metric:
    """Compute variance over all scores."""

    def metric(scores: list[Score]) -> float:
        return np.var([score.as_float() for score in scores]).item()

    return metric

Note that the Score class contains a Value that is a union over several scalar and collection types. As a convenience, Score includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float() accessor).

Reducing Epochs

If a task is run over more than one epoch, multiple scores will be generated for each sample. These scores are then reduced to a single score representing the score for the sample across all the epochs.

By default, this is done by taking the mean of all sample scores, but you may specify other strategies for reducing the samples by passing an Epochs, which includes both a count and one or more reducers to combine sample scores with. For example:

@task
def gpqa():
    return Task(
        dataset=read_gpqa_dataset("gpqa_main.csv"),
        solver=[
            system_message(SYSTEM_MESSAGE),
            multiple_choice(),
        ],
        scorer=choice(),
        epochs=Epochs(5, "mode"),
    )

You may also specify more than one reducer which will compute metrics using each of the reducers. For example:

@task
def gpqa():
    return Task(
        ...
        epochs=Epochs(5, ["at_least_2", "at_least_5"]),
    )

Built-in Reducers

Inspect includes several built in reducers which are summarised below.

Reducer Description
mean Reduce to the average of all scores.
median Reduce to the median of all scores
mode Reduce to the most common score.
max Reduce to the maximum of all scores.
pass_at_{k} Probability of at least 1 correct sample given k epochs (https://arxiv.org/pdf/2107.03374)
at_least_{k} 1 if at least k samples are correct, else 0.
Note

The built in reducers will compute a reduced value for the score and populate the fields answer and explanation only if their value is equal across all epochs. The metadata field will always be reduced to the value of metadata in the first epoch. If your custom metrics function needs differing behavior for reducing fields, you should also implement your own custom reducer and merge or preserve fields in some way.

Custom Reducers

You can also add your own reducer with @score_reducer decorated functions. Here’s a somewhat simplified version of the code for the mean reducer:

import statistics

from inspect_ai.scorer import Score, ScoreReducer, score_reducer

@score_reducer(name="mean")
def mean_score() -> ScoreReducer:
    def reduce(scores: list[Score]) -> Score:
        """Compute a mean value of all scores."""

        values = [float(score.value) for score in scores]
        mean_value = statistics.mean(values)

        return Score(value=mean_value)

    return reduce

Workflow

Score Command

By default, model output in evaluations is automatically scored. However, you can separate generation and scoring by using the --no-score option. For example:

inspect eval popularity.py --model openai/gpt-4 --no-score

You can score an evaluation previously run this way using the inspect score command:

# score last eval
inspect score popularity.py

# score specific log file
inspect score popularity.py ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.json
Tip

Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.

Log Overwriting

By default, inspect score overwrites the file it scores. If don’t want to overwrite target files, pass the --no-overwrite flag:

inspect score popularity.py --no-overwrite

When specifying --no-overwrite, a -scored suffix will be added to the original log file name:

./logs/2024-02-23_task_gpt-4_TUhnCn473c6-scored.json

Note that the --no-overwrite flag does not apply to log files that already have the -scored suffix—those files are always overwritten by inspect score. If you plan on scoring multiple times and you want to save each scoring output, you will want to copy the log to another location before re-scoring.

Python API

If you are exploring the performance of different scorers, you might find it more useful to call the score() function using varying scorers or scorer options. For example:

log = eval(popularity, model="openai/gpt-4")[0]

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

scoring_logs = [score(log, model_graded_qa(model=model)) 
                for model in grader_models]

plot_results(scoring_logs)