Dataset Viewer

The dataset viewer is not available because its heuristics could not detect any supported data files. You can try uploading some data files, or configuring the data files location manually.

⚠️ NOTE: FULL Dataset Coming very Soon! ⚠️


dolma-mix

Dolma 3 Mix (5.5T)

The Dolma 3 Mix (5.5T) is the collection of data used during the pretraining stage to train the Olmo-3-1125-32B model. This dataset is made up of ~5.5 trillion tokens from a diverse mix of web content, academic publications, code, and more. The majority of this dataset comes from Common Crawl.

For more information on Dolma, please see our original release here.

Licensing Information

Dolma 3 mix is licensed under the Open Data Commons Attribution License v1.0 (ODC-By). It is intended for research and educational use. For more information, please see our Responsible Use Guidelines.

Citation

A technical manuscript is forthcoming! Find the paper at: https://allenai.org/papers/olmo3

Downloads last month
313

Models trained or fine-tuned on allenai/dolma3_mix-5.5T-1125