AI Ideas MissionMission AboutAbout Contact ↗Contact ↗ (opens in a new tab)

GitHub (opens in a new tab)

Homepage
AI Vision
Product
AI Tools
AI Ideas
Foundation
Hardware
Library
Models
Open Source
Software Engineering
- Data Analysis
- Datasets
  Allenai Open Data
  Allenai S2orc S2ag
  Code Pile
  Common Crawl
  Dolma
  Downloaders
  Maths
  The Pile
  The Stack V2

Question? Give us feedback → (opens in a new tab)Edit this page

data

Common Crawl

Common Crawl

Common Crawl (opens in a new tab)

Common Crawl maintains a free, open repository of web crawl data that can be used by anyone.

We make wholesale extraction, transformation and analysis of open web data accessible to researchers.

Over 250 billion pages spanning 17 years. Free and open corpus since 2007. Cited in over 10,000 research papers. 3–5 billion new pages added each month.

Code Pile Dolma

LICENSE: ...