data
The Pile

the-pile

https://github.com/EleutherAI/the-pile (opens in a new tab)

https://pile.eleuther.ai/ (opens in a new tab)

https://arxiv.org/abs/2101.00027 (opens in a new tab)

The Pile is a large, diverse, open source language modelling data set that consists of many smaller datasets combined together. The objective is to obtain text from as many modalities as possible to ensure that models trained using The Pile will have much broader generalization abilities.

An 800GB Dataset of Diverse Text for Language Modeling

Pile Deduplication Code https://github.com/EleutherAI/pile_dedupe (opens in a new tab)


Pile 2

https://huggingface.co/datasets/EleutherAI/proof-pile-2 (opens in a new tab)