code-pile

CarperAI/Code-Pile (opens in a new tab)
- This repository contains all the code for collecting large scale amounts of code from GitHub.
- codepile proposal (opens in a new tab)
- Foundation models in the NLP domain have unlocked numerous applications and have served as a building block of specialized models via finetuning. Similarly, having such models for Software Engineering has the potential to serve a similar purpose from coding assistant applications to being the building blocks of CarperAI's reinforcement learning projects. To enable the training of these foundation models, we will collect software engineering-specific data that goes beyond the GitHub code sources that are focused on currently. This includes StackOverflow, documentation sites of popular libraries and frameworks, tutorial websites such as tutorial point and geeks4geeks, mining reddit communities that are programming specific, and other repository data from GitHub such as issues, pull requests, community discussions, diffs, etc. For better understanding the data these foundation models are trained on, we will pay special attention to the statistics of vulnerable code.
- Paper: https://arxiv.org/pdf/2101.00027.pdf (opens in a new tab)
- Repository: https://github.com/EleutherAI/the-pile (opens in a new tab)

Allenai S2orc S2ag Common Crawl