Structured Output
The structured output feature described below is currently available only in the development version of Inspect. To install the development version from GitHub:
pip install git+https://github.com/UKGovernmentBEIS/inspect_ai
Overview
Structured output is a feature supported by some model providers to ensure that models generate responses which adhere to a supplied JSON Schema. Structured output is currently supported in Inspect for the OpenAI, Google, and Mistral providers.
While structured output may seem like a robust solution to model unreliability, it’s important to keep in mind that by specifying a JSON schema you are also introducing unknown effects on model task performance. There is even some early literature indicating that models perform worse with structured output.
You should therefore test the use of structured output as an elicitation technique like you would any other, and only proceed if you feel confident that it has made a genuine improvement in your overall task.
Example
Below we’ll walk through a simple example of using structured output to constrain model output to a Color
type that provides red, green, and blue components. If you want to experiment with it further, see the source code in the Inspect GitHub repository.
Imagine first that we have the following dataset:
from inspect_ai.dataset import Sample
=[
colors_dataset
Sample(input="What is the RGB color for white?",
="255,255,255",
target
),
Sample(input="What is the RGB color for black?",
="0,0,0",
target
), ]
We want the model to give us the RGB values for the colors, but it might choose to output these colors in a wide variety of formats—parsing these formats in our scorer could be laborious and error prone.
Here we define a Pydantic Color
type that we’d like to get back from the model:
from pydantic import BaseModel
class Color(BaseModel):
int
red: int
green: int blue:
To instruct the model to return output in this type, we use the response_schema
generate config option, using the json_schema() function to produce a schema for our type. Here is complete task definition which uses the dataset and color type from above:
from inspect_ai import Task, task
from inspect_ai.model import GenerateConfig, ResponseSchema
from inspect_ai.solver import generate
from inspect_ai.util import json_schema
@task
def rgb_color():
return Task(
=colors_dataset,
dataset=generate(),
solver=score_color(),
scorer=GenerateConfig(
config=ResponseSchema(
response_schema="color",
name=json_schema(Color)
json_schema
)
), )
We use the json_schema() function to create a JSON schema for our Color
type, then wrap that in a ResponseSchema where we also assign it a name.
You’ll also notice that we have specified a custom scorer. We need this to both parse and evaluate our custom type (as models still return JSON output as a string). Here is the scorer:
from inspect_ai.scorer import (
CORRECT,
INCORRECT,
Score,
Target,
accuracy,
scorer,
stderr,
)from inspect_ai.solver import TaskState
@scorer(metrics=[accuracy(), stderr()])
def score_color():
async def score(state: TaskState, target: Target):
try:
= Color.model_validate_json(state.output.completion)
color if f"{color.red},{color.green},{color.blue}" == target.text:
= CORRECT
value else:
= INCORRECT
value return Score(
=value,
value=state.output.completion,
answer
)except ValidationError as ex:
return Score(
=INCORRECT,
value=state.output.completion,
answer=f"Error parsing response: {ex}",
explanation
)
return score
The Pydantic Color
type has a convenient model_validate_json()
method which we can use to read the model’s output (being sure to catch the ValidationError
if the model produces incorrect output).
Schema
The json_schema() function supports creating schemas for any Python type including Pydantic models, dataclasses, and typed dicts. That said, Pydantic models are highly recommended as they provide additional parsing and validation which is generally required for scorers.
The response_schema
generation config option takes a ResponseSchema object which includes the schema and some additional fields:
from inspect_ai.model import ResponseSchema
from inspect_ai.util import json_schema
= GenerateConfig(
config =ResponseSchema(
response_schema="color", # required name field
name=json_schema(Color), # schema for custom type
json_schema="description", # optional field with more context
description=False # force model to adhere to schema
strict
) )
Note that not all model providers support all of these options. In particular, only the Mistral and OpenAI providers support the name
, description
, and strict
fields (the Google provider takes the json_schema
only).
You should therefore never assume that specifying strict
gets your scorer off the hook for parsing and validating the model output as some models won’t respect strict
. Using strict
may also impact task performance—as always it’s best to experiment and measure!