data
Maths

Maths datasets

  • EleutherAI/proof-pile-2 (opens in a new tab)
    • The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets:
      • arxiv (29B tokens): the ArXiv subset of RedPajama
      • open-web-math (15B tokens): The OpenWebMath dataset, which contains much of the high-quality mathematical text from the internet.
      • algebraic-stack (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.