Inspect Evals
Inspect Evals is a repository of community contributed evaluations featuring implementations of many popular benchmarks and papers.
These evals can be pip
installed and run with a single command against any model. They are also useful as a learning resource as they demonstrate a wide variety of evaluation types and techniques.
Coding
-
Evaluating correctness for synthesizing Python programs from docstrings. Demonstrates custom scorers and sandboxing untrusted model code.
-
Measuring the ability of these models to synthesize short Python programs from natural language descriptions. Demonstrates custom scorers and sandboxing untrusted model code.
-
Software engineering problems drawn from real GitHub issues and corresponding pull requests across 12 popular Python repositories. Demonstrates sandboxing untrusted model code.
-
DS-1000 is a code generation benchmark with a thousand data science problems spanning seven Python libraries.
-
Python coding benchmark with 1,140 diverse questions drawing on numerous python libraries.
-
Evaluates LLMs on class-level code generation with 100 tasks constructed over 500 person-hours. The study shows that LLMs perform worse on class-level tasks compared to method-level tasks. GPT-4 and GPT-3.5 outperform other models, with holistic generation being the best strategy for them, while incremental generation works better for other models. The study also shows that the performance of LLMs is highly correlated with the number of tokens in the prompt.
Assistants
-
GAIA proposes real-world questions that require a set of fundamental abilities such as reasoning, multi-modality handling, web browsing, and generally tool-use proficiency. GAIA questions are conceptually simple for humans yet challenging for most advanced AIs
-
Tests whether AI agents can perform real-world time-consuming tasks on the web. Includes 214 realistic tasks covering a variety scenarios and domains.
Cybersecurity
-
40 professional-level Capture the Flag (CTF) tasks from 4 distinct CTF competitions, chosen to be recent, meaningful, and spanning a wide range of difficulties.
-
Datasets containing 80, 500, 2000 and 10000 multiple-choice questions, designed to evaluate understanding across nine domains within cybersecurity
-
Evaluates Large Language Models for risky capabilities in cybersecurity.
-
Measure expertise in coding, cryptography (i.e. binary exploitation, forensics), reverse engineering, and recognizing security vulnerabilities. Demonstrates tool use and sandboxing untrusted model code.
-
CTF challenges covering web app vulnerabilities, off-the-shelf exploits, databases, Linux privilege escalation, password cracking and spraying. Demonstrates tool use and sandboxing untrusted model code.
-
Designed for analyzing cybersecurity incidents, which is comprised of two primary task categories: understanding and generation, with a further breakdown into 28 subcategories of tasks.
-
"Security Question Answering" dataset to assess LLMs' understanding and application of security principles. SecQA has “v1” and “v2” datasets of multiple-choice questions that aim to provide two levels of cybersecurity evaluation criteria. The questions were generated by GPT-4 based on the "Computer Systems Security: Planning for Success" textbook and vetted by humans.
Safeguards
-
Diverse set of 110 explicitly malicious agent tasks (440 with augmentations), covering 11 harm categories including fraud, cybercrime, and harassment.
-
A dataset of 3,668 multiple-choice questions developed by a consortium of academics and technical consultants that serve as a proxy measurement of hazardous knowledge in biosecurity, cybersecurity, and chemical security.
Mathematics
-
Dataset of 12,500 challenging competition mathematics problems. Demonstrates fewshot prompting and custom scorers.
-
Dataset of 8.5K high quality linguistically diverse grade school math word problems. Demonstrates fewshot prompting.
-
Diverse mathematical and visual tasks that require fine-grained, deep visual understanding and compositional reasoning. Demonstrates multimodal inputs and custom scorers.
-
Extends the original GSM8K dataset by translating 250 of its problems into 10 typologically diverse languages.
Reasoning
-
V*Bench is a visual question & answer benchmark that evaluates MLLMs in their ability to process high-resolution and visually crowded images to find and focus on small details.
-
Dataset of natural, grade-school science multiple-choice questions (authored for human tests).
-
Evaluting commonsense natural language inference: given an event description such as "A woman sits at a piano," a machine must select the most likely followup.
-
Measure physical commonsense reasoning (e.g. "To apply eyeshadow without a brush, should I use a cotton swab or a toothpick?")
-
LLM benchmark featuring an average data length surpassing 100K tokens. Comprises synthetic and realistic tasks spanning diverse domains in English and Chinese.
-
Suite of 23 challenging BIG-Bench tasks for which prior language model evaluations did not outperform the average human-rater.
-
Reading comprehension dataset that queries for complex, non-factoid information, and require difficult entailment-like inference to solve.
-
DocVQA is a Visual Question Answering benchmark that consists of 50,000 questions covering 12,000+ document images. This implementation solves and scores the "validation" split.
-
Evaluates reading comprehension where models must resolve references in a question, perhaps to multiple input positions, and perform discrete operations over them (such as addition, counting, or sorting).
-
Set of 273 expert-crafted pronoun resolution problems originally designed to be unsolvable for statistical models that rely on selectional preferences or word associations.
-
Reading comprehension tasks collected from the English exams for middle and high school Chinese students in the age range between 12 to 18.
-
Multimodal questions from college exams, quizzes, and textbooks, covering six core disciplinestasks, demanding college-level subject knowledge and deliberate reasoning. Demonstrates multimodel inputs.
-
Set of 100,000+ questions posed by crowdworkers on a set of Wikipedia articles, where the answer to each question is a segment of text from the corresponding reading passage.
-
Evaluates the ability to follow a set of "verifiable instructions" such as "write in more than 400 words" and "mention the keyword of AI at least 3 times. Demonstrates custom scoring.
-
Evaluating models on multistep soft reasoning tasks in the form of free text narratives.
-
NIAH evaluates in-context retrieval ability of long context LLMs by testing a model's ability to extract factual information from long-context inputs.
-
Evaluating models on the task of paraphrase detection by providing pairs of sentences that are either paraphrases or not.
Knowledge
-
Evaluate models on 57 tasks including elementary mathematics, US history, computer science, law, and more.
-
An enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options.
-
Challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry (experts at PhD level in the corresponding domains reach 65% accuracy).
-
Measure question answering with commonsense prior knowledge.
-
Measure whether a language model is truthful in generating answers to questions using questions that some humans would answer falsely due to a false belief or misconception.
-
Dataset with 250 safe prompts across ten prompt types that well-calibrated models should not refuse, and 200 unsafe prompts as contrasts that models, for most applications, should refuse.
-
Novel biomedical question answering (QA) dataset collected from PubMed abstracts.
-
AGIEval is a human-centric benchmark specifically designed to evaluate the general abilities of foundation models in tasks pertinent to human cognition and problem-solving.
No matching items