Datasets
Overview
Inspect has native support for reading datasets in the CSV, JSON, and JSON Lines formats, as well as from Hugging Face. In addition, the core dataset interface for the evaluation pipeline is flexible enough to accept data read from just about any source (see the Custom Reader section below for details).
If your data is already in a format amenable for direct reading as an Inspect Sample
, reading a dataset is as simple as this:
from inspect_ai.dataset import csv_dataset, json_dataset
= csv_dataset("dataset1.csv")
dataset1 = json_dataset("dataset2.json") dataset2
Of course, many real-world datasets won’t be so trivial to read. Below we’ll discuss the various ways you can adapt your datasets for use with Inspect.
Dataset Samples
The core data type underlying the use of datasets with Inspect is the Sample
, which consists of a required input
field and several other optional fields:
Class inspect_ai.dataset.Sample
Field | Type | Description |
---|---|---|
input |
str | list[ChatMessage] |
The input to be submitted to the model. |
choices |
list[str] | None |
Optional. Multiple choice answer list. |
target |
str | list[str] | None |
Optional. Ideal target output. May be a literal value or narrative text to be used by a model grader. |
id |
str | None |
Optional. Unique identifier for sample. |
metadata |
dict[str | Any] | None |
Optional. Arbitrary metadata associated with the sample. |
sandbox |
str | tuple[str,str] |
Optional. Sandbox environment type (or optionally a tuple with type and config file) |
files |
dict[str | str] | None |
Optional. Files that go along with the sample (copied to sandbox environments). |
setup |
str | None |
Optional. Setup script to run for sample (executed within default sandbox environment). |
So a CSV dataset with the following structure:
input | target |
---|---|
What cookie attributes should I use for strong security? | secure samesite and httponly |
How should I store passwords securely for an authentication system database? | strong hashing algorithms with salt like Argon2 or bcrypt |
Can be read directly with:
= csv_dataset("security_guide.csv") dataset
Note that samples from datasets without an id
field will automatically be assigned ids based on an auto-incrementing integer starting with 1.
If your samples include choices
, then the target
should be a numeric index into the available choices
rather than a letter (this is an implicit assumption of the multiple_choice()
solver).
Files
The files
field maps container target file paths to file contents (where contents can be either a filesystem path, a URL, or a string with inline content). For example, to copy a local file named flag.txt
into the container path /shared/flag.txt
you would use this:
"/shared/flag.txt": "flag.txt"
Files are copied into the default sandbox environment unless their name contains a prefix mapping them into another environment. For example, to copy into the victim
container:
"victim:/shared/flag.txt": "flag.txt"
Setup
The setup
field contains either a path to a bash setup script (resolved relative to the dataset path) or the contents of a script to execute. Setup scripts are executed with a 5 minute timeout. If you have setup scripts that may take longer than this you should move some of your setup code into the container build setup (e.g. Dockerfile).
Field Mapping
If your dataset contains inputs and targets that don’t use input
and target
as field names, you can map them into a Dataset
using a FieldSpec
. This same mechanism also enables you to collect arbitrary additional fields into the Sample
metadata
bucket. For example:
from inspect_ai.dataset import FieldSpec, json_dataset
= json_dataset(
dataset "popularity.jsonl",
FieldSpec(input="question",
="answer_matching_behavior",
targetid="question_id",
=["label_confidence"],
metadata
), )
If you need to do more than just map field names and actually do custom processing of the data, you can instead pass a function which takes a record
(represented as a dict
) from the underlying file and returns a Sample
. For example:
from inspect_ai.dataset import Sample, json_dataset
def record_to_sample(record):
return Sample(
input=record["question"],
=record["answer_matching_behavior"].strip(),
targetid=record["question_id"],
={
metadata"label_confidence": record["label_confidence"]
}
)
= json_dataset("popularity.jsonl", record_to_sample) dataset
Typed Metadata
If you want a more strongly typed interface to sample metadata, you can define a Pydantic model and use it to both validate and read metadata.
For validation, pass a BaseModel
derived class in the FieldSpec
. The interface to metadata is read-only so you must also specify frozen=True
. For example:
from pydantic import BaseModel
class PopularityMetadata(BaseModel, frozen=True):
str
category: float
label_confidence:
= json_dataset(
dataset "popularity.jsonl",
FieldSpec(input="question",
="answer_matching_behavior",
targetid="question_id",
=PopularityMetadata,
metadata
), )
To read metadata in a typesafe fashion, us the metadata_as()
method on Sample
or TaskState
:
= state.metadata_as(PopularityMetadata) metadata
Note again that the intended semantics of metadata
are read-only, so attempting to write into the returned metadata will raise a Pydantic FrozenInstanceError
.
If you need per-sample mutable data, use the sample store, which also supports typing using Pydantic models.
Filter and Shuffle
The Dataset
class includes filter()
and shuffle()
methods, as well as support for the slice operator.
To select a subset of the dataset, use filter()
:
= json_dataset("popularity.jsonl", record_to_sample)
dataset = dataset.filter(
dataset lambda sample : sample.metadata["category"] == "advanced"
)
To select a subset of records, use standard Python slicing:
= dataset[0:100] dataset
Shuffling is often helpful when you want to vary the samples used during evaluation development. To do this, either use the shuffle()
method or the shuffle
parameter of the dataset loading functions:
# shuffle method
= dataset.shuffle()
dataset
# shuffle on load
= json_dataset("data.jsonl", shuffle=True) dataset
Note that both of these methods optionally support specifying a random seed for shuffling.
Hugging Face
Hugging Face Datasets is a library for easily accessing and sharing datasets for machine learning, and features integration with Hugging Face Hub, a repository with a broad selection of publicly shared datasets. Typically datasets on Hugging Face will require specification of which split within the dataset to use (e.g. train, test, or validation) as well as some field mapping. Use the hf_dataset()
function to read a dataset and specify the requisite split and field names:
from inspect_ai.dataset import FieldSpec, hf_dataset
=hf_dataset("openai_humaneval",
dataset="test",
split=FieldSpec(
sample_fieldsid="task_id",
input="prompt",
="canonical_solution",
target=["test", "entry_point"]
metadata
) )
Note that some HuggingFace datasets execute Python code in order to resolve the underlying dataset files. Since this code is run on your local machine, you need to specify trust = True
in order to perform the download. This option should only be set to True
for repositories you trust and in which you have read the code. Here’s an example of using the trust
option (note that it defaults to False
if not specified):
=hf_dataset("openai_humaneval",
dataset="test",
split=True,
trust
... )
Under the hood, the hf_dataset()
function is calling the load_dataset() function in the Hugging Face datasets package. You can additionally pass arbitrary parameters on to load_dataset()
by including them in the call to hf_dataset()
. For example hf_dataset(..., cache_dir="~/my-cache-dir")
.
Amazon S3
Inspect has integrated support for storing datasets on Amazon S3. Compared to storing data on the local file-system, using S3 can provide more flexible sharing and access control, and a more reliable long term store than local files.
Using S3 is mostly a matter of substituting S3 URLs (e.g. s3://my-bucket-name
) for local file-system paths. For example, here is how you load a dataset from S3:
"s3://my-bucket/dataset.jsonl") json_dataset(
S3 buckets are normally access controlled so require authentication to read from. There are a wide variety of ways to configure your client for AWS authentication, all of which work with Inspect. See the article on Configuring the AWS CLI for additional details.
Chat Messages
The most important data structure within Sample
is the ChatMessage
. Note that often datasets will contain a simple string as their input (which is then internally converted to a ChatMessageUser
). However, it is possible to include a full message history as the input via ChatMessage
. Another useful application of ChatMessage
is providing multi-modal input (e.g. images).
Class inspect_ai.model.ChatMessage
Field | Type | Description |
---|---|---|
role |
"system" | "user" | "assistant" | "tool" |
Role of this chat message. |
content |
str | list[Content] |
The content of the message. Can be a simple string or a list of content parts intermixing text and images. |
An input with chat messages in your dataset might will look something like this:
"input": [
{"role": "user",
"content": "What cookie attributes should I use for strong security?"
} ]
Note that for this example we wouldn’t normally use a full chat message object (rather we’d just provide a simple string). Chat message objects are more useful when you want to include a system prompt or prime the conversation with “assistant” responses.
Custom Reader
You are not restricted to the built in dataset functions for reading samples. You can also construct a MemoryDataset
, and pass that to a task. For example:
from inspect_ai import Task, task
from inspect_ai.dataset import MemoryDataset, Sample
from inspect_ai.scorer import model_graded_fact
from inspect_ai.solver import generate, system_message
=MemoryDataset([
dataset
Sample(input="What cookie attributes should I use for strong security?",
="secure samesite and httponly",
target
)
])
@task
def security_guide():
return Task(
=dataset,
dataset=[system_message(SYSTEM_MESSAGE), generate()],
solver=model_graded_fact(),
scorer )
So if the built in dataset functions don’t meet your needs, you can create a custom function that yields a MemoryDataset
and pass those directly to your Task
.