Inspect

HumanEval: Evaluating Large Language Models Trained on Code

Evaluating correctness for synthesizing Python programs from docstrings. Demonstrates custom scorers and sandboxing untrusted model code.

MBPP: Mostly Basic Python Problems

Measuring the ability of these models to synthesize short Python programs from natural language descriptions. Demonstrates custom scorers and sandboxing untrusted model code.

SWE-Bench: Resolving Real-World GitHub Issues

Software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Demonstrates sandboxing untrusted model code.

GAIA: A Benchmark for General AI Assistants

GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs

InterCode: Capture the Flag

Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.

GDM Dangerous Capabilities: Capture the Flag

CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.

MATH: Measuring Mathematical Problem Solving

Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers.

GSM8K: Training Verifiers to Solve Math Word Problems

Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demostrates fewshot prompting.

MathVista: Evaluating Mathematical Reasoning in Visual Contexts

Diverse mathematical and visual tasks that require fine-grained, deep visual understanding and compositional reasoning. Demonstrates multimodal inputs and custom scorers.

ARC: AI2 Reasoning Challenge

Dataset of natural, grade-school science multiple-choice questions (authored for human tests).

HellaSwag: Can a Machine Really Finish Your Sentence?

Evaluting commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup.

PIQA: Reasoning about Physical Commonsense in Natural Language

Measure physical commonsense reasoning (e.g. "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?")

BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions

Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.

DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs

Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).

WINOGRANDE: An Adversarial Winograd Schema Challenge at Scale

Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.

RACE-H: A benchmark for testing reading comprehension and reasoning abilities of neural models

Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.

MMMU: A Massive Multi-discipline Multimodal Understanding and Reasoning Benchmark

Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.

SQuAD: A Reading Comprehension Benchmark requiring reasoning over Wikipedia articles

Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.

IFEval: Instruction-Following Evaluation for Large Language Models

Evaluates the ability to follow a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times. Demonstrates custom scoring.

AGIEval: A Human-Centric Benchmark for Evaluating Foundation Models

Questions from human-centric standardized exams, such as college entrance exams, law school admission tests, math competitions, and lawyer qualification tests. Demonstrates custom scoring.

MMLU: Measuring Massive Multitask Language Understanding

Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

An enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry (experts at PhD level in the corresponding domains reach 65% accuracy).

CommonsenseQA: A Question Answering Challenge Targeting Commonsense Knowledge

Measure question answering with commonsense prior knowledge.

TruthfulQA: Measuring How Models Mimic Human Falsehoods

Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.

XSTest: A benchmark for identifying exaggerated safety behaviours in LLM's

Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.

PubMedQA: A Dataset for Biomedical Research Question Answering

Novel biomedical question answering (QA) dataset collected from PubMed abstracts.

Examples

Coding

Assistants

Cybersecurity

Mathematics

Reasoning

Knowledge