Scorers

Overview

Scorers evaluate whether solvers were successful in finding the right output for the target defined in the dataset, and in what measure. Scorers generally take one of the following forms:

  1. Extracting a specific answer out of a model’s completion output using a variety of heuristics.

  2. Applying a text similarity algorithm to see if the model’s completion is close to what is set out in the target.

  3. Using another model to assess whether the model’s completion satisfies a description of the ideal answer in target.

  4. Using another rubric entirely (e.g. did the model produce a valid version of a file format, etc.)

Scorers also define one or more metrics which are used to aggregate scores (e.g. accuracy() which computes what percentage of scores are correct, or mean() which provides an average for scores that exist on a continuum).

Built-In Scorers

Inspect includes some simple text matching scorers as well as a couple of model graded scorers. Built in scorers can be imported from the inspect_ai.scorer module. Below is a summary of these scorers. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.

  • includes()

    Determine whether the target from the Sample appears anywhere inside the model output. Can be case sensitive or insensitive (defaults to the latter).

  • match()

    Determine whether the target from the Sample appears at the beginning or end of model output (defaults to looking at the end). Has options for ignoring case, white-space, and punctuation (all are ignored by default).

  • pattern()

    Extract the answer from model output using a regular expression.

  • answer()

    Scorer for model output that preceded answers with “ANSWER:”. Can extract letters, words, or the remainder of the line.

  • model_graded_qa()

    Have another model assess whether the model output is a correct answer based on the grading guidance contained in target. Has a built-in template that can be customised.

  • model_graded_fact()

    Have another model assess whether the model output contains a fact that is set out in target. This is a more narrow assessment than model_graded_qa(), and is used when model output is too complex to be assessed using a simple match() or pattern() scorer.

Scorers provide one or more built-in metrics (each of the scorers above provides accuracy as a metric). You can also provide your own custom metrics in Task definitions. For example:

Task(
    dataset=dataset,
    plan=[
        system_message(SYSTEM_MESSAGE),
        multiple_choice()
    ],
    scorer=match(),
    metrics=[custom_metric()]
)

Model Graded

Model graded scorers are well suited to assessing open ended answers as well as factual answers that are embedded in a longer narrative. The built-in model graded scorers can be customised in several ways—you can also create entirely new model scorers (see the model graded example below for a starting point).

Here is the declaration for the model_graded_qa() function:

@scorer(metrics=[accuracy(), bootstrap_std()])
def model_graded_qa(
    template: str | None = None,
    instructions: str | None = None,
    grade_pattern: str | None = None,
    partial_credit: bool = False,
    model: list[str | Model] | str | Model | None = None,
) -> Scorer:
    ...

The default model graded QA scorer is tuned to grade answers to open ended questions. The default template and instructions ask the model to produce a grade in the format GRADE: C or GRADE: I, and this grade is extracted using the default grade_pattern regular expression. The grading is by default done with the model currently being evaluated. There are a few ways you can customise the default behaviour:

  1. Provide alternate instructions—the default instructions ass the model to use chain of thought reasoning and provide grades in the format GRADE: C or GRADE: I. Note that if you provide instructions that ask the model to format grades in a different way, you will also want to customise the grade_pattern.
  2. Specify partial_credit = True to prompt the model to assign partial credit to answers that are not entirely right but come close (metrics by default convert this to a value of 0.5). Note that this parameter is only valid when using the default instructions.
  3. Specify an alternate model to perform the grading (e.g. a more powerful model or a model fine tuned for grading).
  4. Specify a different template—note that templates are passed these variables: question, criterion, answer, and instructions.

The model_graded_fact() scorer works identically to model_graded_qa(), and simply provides an alternate template oriented around judging whether a fact is included in the model output.

If you want to understand how the default templates for model_graded_qa() and model_graded_fact() work, see their source code.

Multiple Models

The built-in model graded scorers also support using multiple grader models (whereby the final grade is chosen by majority vote). For example, here we specify that 3 models should be used for grading:

model_graded_qa(
    model = [
        "google/gemini-1.0-pro",
        "anthropic/claude-3-opus-20240229" 
        "together/meta-llama/Llama-3-70b-chat-hf",
    ]
)

The implementation of multiple grader models takes advantage of the multi_scorer() and majority_vote() functions, both of which can be used in your own scorers (as described in the Multi Scorer section below).

Custom Scorers

Custom scorers are functions that take a TaskState and Target, and yield a Score.

async def score(state: TaskState, target: Target):
     # Compare state / model output with target
     # to yield a score
     return Score(value=...)

First we’ll talk about the core Score and Value objects, then provide some examples of custom scorers to make things more concrete.

Note that score() above is declared as an async function. When creating custom scorers, it’s critical that you understand Inspect’s concurrency model. More specifically, if your scorer is doing non-trivial work (e.g. calling REST APIs, executing external processes, etc.) please review Parallelism before proceeding.

Score

The components of Score include:

Field Type Description
value Value Value assigned to the sample (e.g. “C” or “I”, or a raw numeric value).
answer str Text extracted from model output for comparison (optional).
explanation str Explanation of score, e.g. full model output or grader model output (optional).
metadata dict[str,Any] Additional metadata about the score to record in the log file (optional).

For example, the following are all valid Score objects:

Score(value="C")
Score(value="I")
Score(value=0.6)
Score(
    value="C" if extracted == target.text else "I", 
    answer=extracted, 
    explanation=state.output.completion
)

If you are extracting an answer from within a completion (e.g. looking for text using a regex pattern, looking at the beginning or end of the completion, etc.) you should strive to always return an answer as part of your Score, as this makes it much easier to understand the details of scoring when viewing the eval log file.

Value

Value is union over the main scalar types as well as a list or dict of the same types:

Value = Union[
    str | int | float | bool,
    list[str | int | float | bool],
    dict[str, str | int | float | bool],
]

The vast majority of scorers will use str (e.g. for correct/incorrect via “C” and “I”) or float (the other types are there to meet more complex scenarios). One thing to keep in mind is that whatever Value type you use in a scorer must be supported by the metrics declared for the scorer (more on this below).

Next, we’ll take a look at the source code for a couple of the built in scorers as a jumping off point for implementing your own scorers. If you are working on custom scorers, you should also review the Scorer Workflow section below for tips on optimising your development process.

Example: Includes

Here is the source code for the built-in includes() scorer:

1@scorer(metrics=[accuracy(), bootstrap_std()])
def includes(ignore_case: bool = True):

2    async def score(state: TaskState, target: Target):

        # check for correct
        answer = state.output.completion
3        target = target.text
        if ignore_case:
            correct = answer.lower().rfind(target.lower()) != -1
        else:
            correct = answer.rfind(target) != -1

        # return score
        return Score(
4            value = CORRECT if correct else INCORRECT,
5            answer=answer
        )

    return score
1
The function applies the @scorer decorator and registers two metrics for use with the scorer.
2
The score() function is declared as async. This is so that it can participate in Inspect’s optimised scheduling for expensive model generation calls (this scorer doesn’t call a model but others will).
3
We make use of the text property on the Target. This is a convenience property to get a simple text value out of the Target (as targets can technically be a list of strings).
4
We use the special constants CORRECT and INCORRECT for the score value (as the accuracy() and bootstrap_std() metrics know how to convert these special constants to float values (1.0 and 0.0 respectively).
5
We provide the full model completion as the answer for the score (answer is optional, but highly recommended as it is often useful to refer to during evaluation development).

Example: Model Grading

Here’s a somewhat simplified version of the code for the model_graded_qa() scorer:


@scorer(metrics=[accuracy(), bootstrap_std()])
def model_graded_qa(
    template: str = DEFAULT_MODEL_GRADED_QA_TEMPLATE,
    instructions: str = DEFAULT_MODEL_GRADED_QA_INSTRUCTIONS,
    grade_pattern: str = DEFAULT_GRADE_PATTERN,
    model: str | Model | None = None,
) -> Scorer:
   
    # resolve grading template and instructions, 
    # (as they could be file paths or URLs)
    template = resource(template)
    instructions = resource(instructions)

    # resolve model
    grader_model = get_model(model)

    async def score(state: TaskState, target: Target) -> Score:
        # format the model grading template
        score_prompt = template.format(
            question=state.input_text,
            answer=state.output.completion,
            criterion=target.text,
            instructions=instructions,
        )

        # query the model for the score
        result = await grader_model.generate(score_prompt)

        # extract the grade
        match = re.search(grade_pattern, result.completion)
        if match:
            return Score(
                value=match.group(1),
                answer=match.group(0),
                explanation=result.completion,
            )
        else:
            return Score(
                value=INCORRECT,
                explanation="Grade not found in model output: "
                + f"{result.completion}",
            )

    return score

Note that the call to model_grader.generate() is done with await—this is critical to ensure that the scorer participates correctly in the scheduling of generation work.

Note also we use the input_text property of the TaskState to access a string version of the original user input to substitute it into the grading template. Using the input_text has two benefits: (1) It is guaranteed to cover the original input from the dataset (rather than a transformed prompt in messages); and (2) It normalises the input to a string (as it could have been a message list).

Multiple Scorers

The multiple scorers feature described below is available in only the very latest version of Inspect (v0.3.18). You can upgrade to the latest version with:

$ pip install --upgrade inspect-ai

There are several ways to use multiple scorers in an evaluation:

  1. You can provide a list of scorers in a Task definition (this is the best option when scorers are entirely independent)
  2. You can yield multiple scores from a Scorer (this is the best option when scores share code and/or expensive computations).
  3. You can use multiple scorers and then aggregate them into a single scorer (e.g. majority voting).

List of Scorers

Task definitions can specify multiple scorers. For example, the below task will use two different models to grade the results, storing two scores with each sample, one for each of the two models:

Task(
    dataset=dataset,
    plan=[
        system_message(SYSTEM_MESSAGE),
        generate()
    ],
    scorer=[
        model_graded_qa(model="openai/gpt-4"), 
        model_graded_qa(model="google/gemini-1.5-pro")
    ],
)

This is useful when there is more than one way to score a result and you would like preserve the individual score values with each sample (versus reducing the multiple scores to a single value).

Scorer with Multiple Values

You may also create a scorer which yields multiple scores. This is useful when the scores use data that is shared or expensive to compute. For example:

@scorer(
1    metrics={
        "a_count": [mean(), bootstrap_std()],
        "e_count": [mean(), bootstrap_std()]
    }
)
def letter_count():
    async def score(state: TaskState, target: Target):
        answer = state.output.completion
        a_count = answer.count("a")
        e_count = answer.count("e")
2        return Score(
            value={"a_count": a_count, "e_count": e_count},
            answer=answer
        )

    return score

task = Task(
    dataset=[Sample(input="Tell me a story."],
    scorer=letter_count()
)
1
The metrics for this scorer are a dictionary—this defines metrics to be applied to scores (by name).
2
The score value itself is a dictionary—the keys corresponding to the keys defined in the metrics on the @scorer decorator.

The above example will produce two scores, a_count and e_count, each of which will have metrics for mean and bootstrap_std.

Reducing Multiple Scores

It’s possible to use multiple scorers in parallel, then reduce their output into a final overall score. This is done using the multi_scorer() function. For example, this is roughly how the built in model graders use multiple models for grading:

multi_scorer(
    scorers = [model_graded_qa(model=model) for model in models],
    reducer = majority_vote
)

Use of multi_scorer() requires both a list of scorers as well as a reducer which is a function that takes a list of scores and turns it into a single score. In this case we use the built in majority_vote() reducer which returns the score that appeared most frequently in the answers.

You can imagine a variety of different strategies for reducing scores (take the average, take the high or low, majority vote, etc.). For example, here’s a reducer that computes the average score:

import numpy as np

def average_score(scores: list[Score]) -> Score:
    values = [score.as_float() for score in scores]
    avg = np.mean(values).item()
    return Score(
        value=avg,
        explanation=f"average of {', '.join(values)}"
    )

Further, you will need to wrap your use of multi_scorer() inside a @scorer decorated function (with the requisite metrics specified). For example:

@scorer(metrics=[mean()])
def multi_model_graded(models)
    return multi_scorer(
        scorers = [model_graded_qa(model=model) for model in models],
        reducer = average_score
    )

Metrics

Each scorer provides one or more built-in metrics (typically accuracy and bootstrap_std). In addition, you can specify other metrics (either built-in or custom) to compute when defining a Task:

Task(
    dataset=dataset,
    plan=[
        system_message(SYSTEM_MESSAGE),
        multiple_choice()
    ],
    scorer=match(),
    metrics=[custom_metric()]
)

Built-In Metrics

Inspect includes some simple built in metrics for calculating accuracy, mean, etc. Built in metrics can be imported from the inspect_ai.scorer module. Below is a summary of these metrics. There is not (yet) reference documentation on these functions so the best way to learn about how they can be customised, etc. is to use the Go to Definition command in your source editor.

  • accuracy()

    Compute proportion of total answers which are correct. For correct/incorrect scores assigned 1 or 0, can optionally assign 0.5 for partially correct answers.

  • mean()

    Mean of all scores.

  • var()

    Variance over all scores.

  • bootstrap_std()

    Standard deviation of a bootstrapped estimate of the mean. 1000 samples are taken by default (modify this using the num_samples option).

Custom Metrics

You can also add your own metrics with @metric decorated functions. For example, here is the implementation of the variance metric:

import numpy as np

from inspect_ai.scorer import Metric, Score, metric

def var() -> Metric:
    """Compute variance over all scores."""

    def metric(scores: list[Score]) -> float:
        return np.var([score.as_float() for score in scores]).item()

    return metric

Note that the Score class contains a Value that is a union over several scalar and collection types. As a convenience, Score includes a set of accessor methods to treat the value as a simpler form (e.g. above we use the score.as_float() accessor).

Workflow

Score Command

By default, model output in evaluations is automatically scored. However, you can separate generation and scoring by using the --no-score option. For example:

inspect eval popularity.py --model openai/gpt-4 --no-score

You can score an evaluation previously run this way using the inspect score command:

# score last eval
inspect score popularity.py

# score specific log file
inspect score popularity.py ./logs/2024-02-23_task_gpt-4_TUhnCn473c6.json
Tip

Using a distinct scoring step is particularly useful during scorer development, as it bypasses the entire generation phase, saving lots of time and inference costs.

Log Overwriting

By default, inspect score overwrites the file it scores. If don’t want to overwrite target files, pass the --no-overwrite flag:

inspect score popularity.py --no-overwrite

When specifying --no-overwrite, a -scored suffix will be added to the original log file name:

./logs/2024-02-23_task_gpt-4_TUhnCn473c6-scored.json

Note that the --no-overwrite flag does not apply to log files that already have the -scored suffix—those files are always overwritten by inspect score. If you plan on scoring multiple times and you want to save each scoring output, you will want to copy the log to another location before re-scoring.

Python API

If you are exploring the performance of different scorers, you might find it more useful to call the score() function using varying scorers or scorer options. For example:

log = eval(popularity, model="openai/gpt-4")[0]

grader_models = [
    "openai/gpt-4",
    "anthropic/claude-3-opus-20240229",
    "google/gemini-1.0-pro",
    "mistral/mistral-large-latest"
]

scoring_logs = [score(log, model_graded_qa(model=model)) 
                for model in grader_models]

plot_results(scoring_logs)