Wikipedia Kaggle Dataset using Structured Contents Snapshot

16 Apr 2025

Wikimedia Enterprise has released a beta dataset on Kaggle, featuring structured Wikipedia content in English and French. Designed with machine learning workflows in mind, this dataset simplifies access to clean, pre-parsed article data that’s immediately usable for modeling, benchmarking, alignment, fine-tuning, and exploratory analysis.

This release is powered by our Snapshot API’s Structured Contents beta, which outputs Wikimedia project data in a developer-friendly, machine-readable format. Instead of scraping or parsing raw article text, Kaggle users can work directly with well-structured JSON representations of Wikipedia content—making this ideal for training models, building features, and testing NLP pipelines.The dataset upload, as of 15 April 2025, includes high-utility elements such as abstracts, short descriptions, infobox-style key-value data, image links, and clearly segmented article sections (excluding references and other non-prose elements which are only available via Snapshot API). Because all content is derived from Wikipedia, it is freely licensed under Creative Commons Attribution-Share-Alike 4.0 and the GNU Free Documentation License (GFDL), with some additional cases where public domain or alternative licenses may apply.

“As the place the machine learning community comes for tools and tests, Kaggle is extremely excited to be the host for the Wikimedia Foundation’s data. Kaggle is already a top place people go to find datasets, and there are few open datasets that have more impact than those hosted by the Wikimedia Foundation. Kaggle is excited to play a role in keeping this data accessible, available and useful.”

— Brenda Flynn, Partnerships Lead, Kaggle

As a beta release, this dataset is an invitation to engage directly with the machine learning community on Kaggle, gather feedback, and refine the datasets for production release. We welcome questions, suggestions, and discussion in the dataset’s discussion tab.

Get the Dataset

Access the dataset directly on Kaggle ➡️

About Kaggle

Kaggle is home to one of the world’s largest communities of machine learning practitioners, researchers, and data enthusiasts. With millions of users and an expansive ecosystem of datasets, notebooks, and competitions—including challenges like the Arc Prize—Kaggle provides an ideal environment for experimenting with open structured data like Wikimedia’s Structured Content. Whether you’re testing a new architecture, evaluating data quality, or building a pipeline from scratch, this Wikipedia dataset is ready to plug into your process.

Read the announcement on Google Blog.

Signup to get the latest Snapshot API datasets for free

— Wikimedia Enterprise Team

← Back to Blog | Top ↑

Receive our news and updates using RSS.

Wikipedia Kaggle Dataset using Structured Contents Snapshot

Get the Dataset

About Kaggle

Latest Articles:

Wikimedia Enterprise Partners with ProRata.ai to Champion Sustainable Search Engine Practices

Parsing Wikipedia References with Quality Scoring Models

The Future of Open Access: Join Wikimedia Enterprise at SXSW 2025!