data
Allenai S2orc S2ag

allenai.org/ai-for-science

allenai.org/ai-for-science (opens in a new tab)

Ai2 collects and provides open and programmatic access to large corpora of scientific text to encourage researchers and developers to create tools for better understanding and engaging with scientific publications.

S2ORC

  • S2ORC (opens in a new tab) - A large corpus of structured full text for English-language open access academic papers. It is the largest publicly available collection of machine-readable academic text, comprising over 10M documents.
    • S2ORC: The Semantic Scholar Open Research Corpus (opens in a new tab)
    • In S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines is introduced, which is expected to facilitate research and development of tools and tasks for text mining over academic text.
    • We introduce S2ORC, a large corpus of 81.1M English-language academic papers spanning many academic disciplines. The corpus consists of rich metadata, paper abstracts, resolved bibliographic references, as well as structured full text for 8.1M open access papers. Full text is annotated with automatically-detected inline mentions of citations, figures, and tables, each linked to their corresponding paper objects. In S2ORC, we aggregate papers from hundreds of academic publishers and digital archives into a unified source, and create the largest publicly-available collection of machine-readable academic text to date. We hope this resource will facilitate research and development of tools and tasks for text mining over academic text.
    • paper pdf (opens in a new tab)

S2AG