Maths datasets
- EleutherAI/proof-pile-2 (opens in a new tab)
- The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets:
- arxiv (29B tokens): the ArXiv subset of RedPajama
- open-web-math (15B tokens): The OpenWebMath dataset, which contains much of the high-quality mathematical text from the internet.
- algebraic-stack (11B tokens): A new dataset of mathematical code, including numerical computing, computer algebra, and formal mathematics.
- The Proof-Pile-2 is a 55 billion token dataset of mathematical and scientific documents. This dataset was created in order to train the Llemma 7B and Llemma 34B models. It consists of three subsets: