Wikimedia Enterprise has released a beta dataset on Kaggle, featuring structured Wikipedia content in English and French. Designed with machine learning workflows in mind, this dataset simplifies access to clean, pre-parsed article data that’s immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis.
This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements which are only available via Snapshot API). Because all content is derived from Wikipedia, it is freely licensed under Creative Commons Attribution-Share-Alike 4.0 and the GNU Free Documentation License (GFDL), with some additional cases where public domain or alternative licenses may apply.
“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data. Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation. Kaggle is excited to play a role in keeping this data accessible, available and useful.”
— Brenda Flynn, Partnerships Lead, Kaggle
As a beta release, this dataset is an invitation to engage directly with the machine learning community on Kaggle, gather feedback, and refine the datasets for production release. We welcome questions, suggestions, and discussion in the dataset’s discussion tab.
Get the Dataset
Access the dataset directly on Kaggle ➡️
About Kaggle
Kaggle is home to one of the world’s largest communities of machine learning practitioners, researchers, and data enthusiasts. With millions of users and an expansive ecosystem of datasets, notebooks, and competitions—including challenges like the Arc Prize—Kaggle provides an ideal environment for experimenting with open structured data like Wikimedia’s Structured Content. Whether you’re testing a new architecture, evaluating data quality, or building a pipeline from scratch, this Wikipedia dataset is ready to plug into your process.
Read the announcement on Google Blog.
— Wikimedia Enterprise Team