allenai open-data

allenai.org/open-data

allenai.org/open-data (opens in a new tab)

WildChat https://wildchat.allen.ai/ (opens in a new tab)
- The WildChat Dataset is a corpus of 1 million real-world user-ChatGPT interactions, characterized by a wide range of languages and a diversity of user prompts. It was constructed by offering free access to ChatGPT and GPT-4 in exchange for consensual chat history collection.
Super-NaturalInstructions https://instructions.apps.allenai.org/ (opens in a new tab)
- 1,616 diverse NLP tasks over 76 distinct task types along with expert-written instructions to measure how well NLP models can generalize to a variety of unseen tasks when provided with clear guidance.
Self-Instruct https://github.com/yizhongw/self-instruct (opens in a new tab)
- A framework that helps language models improve their ability to follow natural language instructions by using the model's own generations to create a large collection of instructional data.
- Self-Instruct: Aligning Language Models with Self-Generated Instructions (opens in a new tab) paper
S2ORC
- A large corpus of structured full text for English-language open access academic papers. It is the largest publicly-available collection of machine-readable academic text, comprising over 10M documents. It aims to facilitate research and development of tools for text mining over academic text.
- paper S2ORC: The Semantic Scholar Open Research Corpus (opens in a new tab)
S2AG
- A collection of over 200M paper titles, abstracts, citations, and other metadata of open-access papers from the Semantic Scholar Academic Graph.
- The Semantic Scholar Open Data Platform (opens in a new tab) paper
HellaSwag https://rowanzellers.com/hellaswag/ (opens in a new tab)
- A challenge dataset of questions that are trivial for humans (>95% accuracy) but that state-of-the-art models struggle with (<48%), created through a collection paradigm wherein a series of discriminators iteratively select an adversarial set of machine-generated wrong answers.
- HellaSwag: Can a Machine Really Finish Your Sentence? (opens in a new tab) paper
- github (opens in a new tab)
- data (opens in a new tab)
WinoGrande https://huggingface.co/datasets/allenai/winogrande (opens in a new tab)
- WinoGrande is a collection of 44K problems, inspired by Winograd Schema Challenge, but adjusted to improve the scale and robustness against the dataset-specific bias. Formulated as a fill-in-a-blank task with binary options, the goal is to choose the right option for a given sentence which requires commonsense reasoning.
SciRIFF https://huggingface.co/datasets/allenai/SciRIFF (opens in a new tab)
- 137K instruction-following demonstrations for 54 scientific literature understanding tasks. The tasks cover five essential scientific literature categories and span five domains.
KIWI https://huggingface.co/datasets/fangyuan/kiwi (opens in a new tab)
- Instruction data collected for writing paragraph-level answers to multiple document-grounded NLP research questions. It was collected via 234 interactive sessions of NLP experts instructing different language models, culminating in 1.2K interaction turns.
CHIME https://github.com/allenai/chime (opens in a new tab)
- 2.1K LLM-generated hierarchical organizations of medical studies on 472 research topics, with expert-provided corrections for a subset of 100 topics. This data can be used to assess and improve LLM-based tools to assist literature review.
SciFact https://huggingface.co/datasets/allenai/scifact (opens in a new tab)
- 1.4K expert-written scientific claims paired with evidence-containing abstracts annotated with labels and rationales to support the development of scientific claim verification systems. It’s been used in shared tasks like SCIVER and retrieval benchmarks like BEIR.
SciTLDR https://huggingface.co/datasets/allenai/scitldr (opens in a new tab)
- 5.4K extremely short (<30 words) expert-written summaries of 3.2K scientific papers, used to develop models for single document summarization and to develop the initial version of the TLDR feature on Semantic Scholar.
Ai2 Reasoning Challenge (ARC)
- 7,787 genuine grade-school level, multiple-choice science questions partitioned into a Challenge Set and an Easy Set, along with a corpus of over 14 million science sentences relevant to the task. Offered as a challenge to the machine reasoning community.
- Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge (opens in a new tab)
DROP
- A QA dataset that tests the comprehensive understanding of paragraphs. In this crowdsourced, adversarially-created, 96K question-answering benchmark, a system must resolve multiple references in a question, map them onto a paragraph, and perform discrete operations over them (such as addition, counting, or sorting).
- DROP: A Reading Comprehension Benchmark Requiring Discrete Reasoning Over Paragraphs (opens in a new tab)
Qasper https://huggingface.co/datasets/allenai/qasper (opens in a new tab)
- 5K information-seeking questions over 1.5K scientific papers. Each question is asked by an expert researcher and answered by a different expert researcher using supporting evidence from the paper's full text. Qasper has been included in long-context benchmarks such as SCROLLS.
MS^2 https://huggingface.co/datasets/allenai/mslr2022 (opens in a new tab)
- 20K biomedical literature review summaries synthesizing information from over 470K studies. This dataset facilitates the development of systems that can assess and aggregate contradictory evidence across multiple studies, and is one of the first large-scale, publicly available multi-document summarization dataset in the biomedical domain.
HCI alt texts https://github.com/allenai/hci-alt-texts (opens in a new tab)
- 3386 author-written alt texts from HCI publications, of which 547 have been annotated with semantic content. Most figures in scientific papers lack alt text, harming accessibility, and this dataset can be used to build tools for understanding and describing figures, leading to a higher prevalence of alt texts.

Datasets Allenai S2orc S2ag