the-stack-v2
https://huggingface.co/datasets/bigcode/the-stack-v2 (opens in a new tab)
The Stack v2 contains over 3B files in 600+ programming and markup languages. The dataset was created as part of the BigCode Project (opens in a new tab), an open scientific collaboration working on the responsible development of Large Language Models for Code (Code LLMs). The Stack serves as a pre-training dataset for Code LLMs, i.e., code-generating AI systems which enable the synthesis of programs from natural language descriptions as well as other from code snippets.
This dataset is derived from the Software Heritage archive, the largest public archive of software source code and accompanying development history. Software Heritage is an open, non profit initiative to collect, preserve, and share the source code of all publicly available software, launched by Inria, in partnership with UNESCO. We acknowledge Software Heritage for providing access to this invaluable resource. For more details, visit the Software Heritage website (opens in a new tab).
Languages The dataset contains 658 languages. The full list can be found in the language stats table.