Structured Output

The structured output feature described below is currently available only in the development version of Inspect. To install the development version from GitHub:

pip install git+https://github.com/UKGovernmentBEIS/inspect_ai

Overview

Structured output is a feature supported by some model providers to ensure that models generate responses which adhere to a supplied JSON Schema. Structured output is currently supported in Inspect for the OpenAI, Google, and Mistral providers.

While structured output may seem like a robust solution to model unreliability, it’s important to keep in mind that by specifying a JSON schema you are also introducing unknown effects on model task performance. There is even some early literature indicating that models perform worse with structured output.

You should therefore test the use of structured output as an elicitation technique like you would any other, and only proceed if you feel confident that it has made a genuine improvement in your overall task.

Example

Below we’ll walk through a simple example of using structured output to constrain model output to a Color type that provides red, green, and blue components. If you want to experiment with it further, see the source code in the Inspect GitHub repository.

Imagine first that we have the following dataset:

from inspect_ai.dataset import Sample

colors_dataset=[
    Sample(
        input="What is the RGB color for white?",
        target="255,255,255",
    ),
    Sample(
        input="What is the RGB color for black?",
        target="0,0,0",
    ),
]

We want the model to give us the RGB values for the colors, but it might choose to output these colors in a wide variety of formats—parsing these formats in our scorer could be laborious and error prone.

Here we define a Pydantic Color type that we’d like to get back from the model:

from pydantic import BaseModel

class Color(BaseModel):
    red: int
    green: int
    blue: int

To instruct the model to return output in this type, we use the response_schema generate config option, using the json_schema() function to produce a schema for our type. Here is complete task definition which uses the dataset and color type from above:

from inspect_ai import Task, task
from inspect_ai.model import GenerateConfig, ResponseSchema
from inspect_ai.solver import generate
from inspect_ai.util import json_schema

@task
def rgb_color():
    return Task(
        dataset=colors_dataset,
        solver=generate(),
        scorer=score_color(),
        config=GenerateConfig(
            response_schema=ResponseSchema(
              name="color", 
              json_schema=json_schema(Color)
            )
        ),
    )

We use the json_schema() function to create a JSON schema for our Color type, then wrap that in a ResponseSchema where we also assign it a name.

You’ll also notice that we have specified a custom scorer. We need this to both parse and evaluate our custom type (as models still return JSON output as a string). Here is the scorer:

from inspect_ai.scorer import (
    CORRECT,
    INCORRECT,
    Score,
    Target,
    accuracy,
    scorer,
    stderr,
)
from inspect_ai.solver import TaskState

@scorer(metrics=[accuracy(), stderr()])
def score_color():
    async def score(state: TaskState, target: Target):
        try:
            color = Color.model_validate_json(state.output.completion)
            if f"{color.red},{color.green},{color.blue}" == target.text:
                value = CORRECT
            else:
                value = INCORRECT
            return Score(
                value=value,
                answer=state.output.completion,
            )
        except ValidationError as ex:
            return Score(
                value=INCORRECT,
                answer=state.output.completion,
                explanation=f"Error parsing response: {ex}",
            )

    return score

The Pydantic Color type has a convenient model_validate_json() method which we can use to read the model’s output (being sure to catch the ValidationError if the model produces incorrect output).

Schema

The json_schema() function supports creating schemas for any Python type including Pydantic models, dataclasses, and typed dicts. That said, Pydantic models are highly recommended as they provide additional parsing and validation which is generally required for scorers.

The response_schema generation config option takes a ResponseSchema object which includes the schema and some additional fields:

from inspect_ai.model import ResponseSchema
from inspect_ai.util import json_schema

config = GenerateConfig(
  response_schema=ResponseSchema(
    name="color",                   # required name field 
    json_schema=json_schema(Color), # schema for custom type
    description="description",      # optional field with more context
    strict=False                    # force model to adhere to schema
  )
)

Note that not all model providers support all of these options. In particular, only the Mistral and OpenAI providers support the name, description, and strict fields (the Google provider takes the json_schema only).

You should therefore never assume that specifying strict gets your scorer off the hook for parsing and validating the model output as some models won’t respect strict. Using strict may also impact task performance—as always it’s best to experiment and measure!