Hugging Face’s Datasets module offers an effective method for loading and processing NLP datasets from raw files or in-memory data. Several academic and practitioner communities throughout the world have contributed to these NLP datasets. Several assessment criteria that are used to assess how well NLP models perform on various tasks may also be loaded.
Hugging Face Datasets for Building NLP Models
The “Dataset” library from Hugging Face is very useful if you are working in natural language processing and need an NLP dataset for your upcoming project. Using other well-known machine learning frameworks like Numpy, Pandas, PyTorch, and TensorFlow, you may utilize this library. The NLP datasets are available in more than 186 languages. All of these datasets may be seen and studied online with the Datasets viewer as well as by browsing the HuggingFace Hub.
In this article, we will learn how to download, load, set up, and use NLP datasets from the collection of hugging face datasets.
Installation of Dataset Library
It simply takes a few minutes to complete installation. Use pip as follows:
List of all the pre-defined datasets
Use the list_datasets() method in the library to get a list of all the different datasets that are accessible.
Python3
from datasets import list_datasets
datasets_list = list_datasets()
print ( 'Total datasets:' , len (datasets_list))
print ( 'First 5 datasets names:\n' , datasets_list[: 5 ])
|
Output:
<ipython-input-5-c9fc4daa7c73>:4: FutureWarning: list_datasets is deprecated and will be removed in the next major version of datasets. Use 'huggingface_hub.list_datasets' instead.
datasets_list = list_datasets()
Total datasets: 79631
First 5 datasets names:
['acronym_identification', 'ade_corpus_v2', 'adversarial_qa', 'aeslc', 'afrikaans_ner_corpus']
There are now 79631 datasets available in the dataset collection that may be utilized to develop various NLP solutions.
A list of the datasets with details may also be viewed by using the list_dataset() function and setting the with_details parameter to True.
Python3
from datasets import list_datasets
datasets_list1 = list_datasets(with_details = True )
print ( 'Total datasets:' , len (datasets_list1))
print ( 'First dataset details:\n' , datasets_list1[ 0 ])
|
Output:
Total datasets: 79631
First dataset details:
DatasetInfo(id='acronym_identification', author=None, sha='c3c245a18bbd57b1682b099e14460eebf154cbdf', last_modified=datetime.datetime(2023, 1, 25, 14, 18, 28, tzinfo=datetime.timezone.utc), private=False, gated=False, disabled=False, downloads=1823, likes=17, paperswithcode_id='acronym-identification', tags=['task_categories:token-classification', 'annotations_creators:expert-generated', 'language_creators:found', 'multilinguality:monolingual', 'size_categories:10K<n<100K', 'source_datasets:original', 'language:en', 'license:mit', 'acronym-identification', 'arxiv:2010.14678', 'region:us'], card_data={'annotations_creators': ['expert-generated'], 'language_creators': ['found'], 'language': ['en'], 'license': ['mit'], 'multilinguality': ['monolingual'], 'size_categories': ['10K<n<100K'], 'source_datasets': ['original'], 'task_categories': ['token-classification'], 'task_ids': [], 'paperswithcode_id': 'acronym-identification', 'pretty_name': 'Acronym Identification Dataset', 'config_names': None, 'train_eval_index': [{'config': 'default', 'task': 'token-classification', 'task_id': 'entity_extraction', 'splits': {'eval_split': 'test'}, 'col_mapping': {'tokens': 'tokens', 'labels': 'tags'}}], 'tags': ['acronym-identification'], 'dataset_info': {'features': [{'name': 'id', 'dtype': 'string'}, {'name': 'tokens', 'sequence': 'string'}, {'name': 'labels', 'sequence': {'class_label': {'names': {'0': 'B-long', '1': 'B-short', '2': 'I-long', '3': 'I-short', '4': 'O'}}}}], 'splits': [{'name': 'train', 'num_bytes': 7792803, 'num_examples': 14006}, {'name': 'validation', 'num_bytes': 952705, 'num_examples': 1717}, {'name': 'test', 'num_bytes': 987728, 'num_examples': 1750}], 'download_size': 8556464, 'dataset_size': 9733236}}, siblings=None)
A list of datasets will be displayed along with their names, descriptions, and files (if downloaded on your local machine).
Loading the downloaded dataset
You must give the file name to the load_dataset() method in order to load the dataset from the library.
This is what the load dataset function will accomplish.
- Download the file processing script from the Hugging Face GitHub repository and import it into the library.
- Execute the script in the file to download the data.
- As requested by the user, deliver the dataset. It automatically retrieves the complete dataset.
Python3
from datasets import load_dataset
dataset = load_dataset( 'ethos' , 'binary' , split = 'train' )
print (dataset)
|
Output:
Dataset({
features: ['text', 'label'],
num_rows: 998
})
Here, we acquired the ethos dataset from the Hugging Face library.
Load a pre-trained BERT model
Load the pre-trained BERT model [https://huggingface.co/bert-base-uncased]
Python3
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model = AutoModelForSequenceClassification.from_pretrained( "bert-base-uncased" )
tokenizer = AutoTokenizer.from_pretrained( "bert-base-uncased" )
|
Tokenization
Python3
def encode(data):
return tokenizer(data[ "text" ], truncation = True , padding = "max_length" )
dataset_ = dataset. map (encode, batched = True )
print ( 'Keys:' ,dataset_[ 0 ].keys())
print ( 'Text:' ,dataset_[ 0 ][ 'text' ])
print ( 'label:' ,dataset_[ 0 ][ 'label' ])
print ( 'Input Id:\n' ,dataset_[ 0 ][ 'input_ids' ])
print ( 'Token type ids:' ,dataset_[ 0 ][ 'token_type_ids' ])
print ( 'Mask:' ,dataset_[ 0 ][ 'attention_mask' ])
|
Output:
Keys: dict_keys(['text', 'label', 'input_ids', 'token_type_ids', 'attention_mask'])
Text: You should know women's sports are a joke
label: 1
Input Id: [101, 2017, 2323, 2113, 2308, 1005, 1055, 2998, 2024, 1037, 8257, 102, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Token type ids: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Mask: [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Dataset for PyTorch
Python3
import torch
dataset_.set_format(
type = "torch" ,
columns = [ "input_ids" , "token_type_ids" , "attention_mask" , "label" ])
dataloader = torch.utils.data.DataLoader(dataset_, batch_size = 32 )
|
Dataset for Tensorflow
Python3
import tensorflow as tf
from transformers import DataCollatorWithPadding
data_collator = DataCollatorWithPadding(tokenizer = tokenizer, return_tensors = "tf" )
tf_dataset = dataset_.to_tf_dataset(
columns = [ "input_ids" , "token_type_ids" , "attention_mask" ],
label_cols = [ "label" ],
batch_size = 32 ,
collate_fn = data_collator,
shuffle = True )
|
Conclusion
We have learned how to get datasets from the collection of Hugging Face datasets, divide them into train and test sets, and more in this post. Not all of the functions offered by the datasets library were covered.
Similar Reads
Text Classification using HuggingFace Model
Text classification is a pivotal task in natural language processing (NLP) that categorizes text into predefined categories. It is widely used in sentiment analysis, spam detection, topic labeling, and more. The development of transformer-based models, such as those provided by Hugging Face, has sig
3 min read
Text Feature Extraction using HuggingFace Model
Text feature extraction converts text data into a numerical format that machine learning algorithms can understand. This preprocessing step is important for efficient, accurate, and interpretable models in natural language processing (NLP). We will discuss more about text feature extraction in this
4 min read
Sentiment Analysis using HuggingFace's RoBERTa Model
Sentiment analysis determines the sentiment or emotion behind a piece of text. It's widely used to analyze customer reviews, social media posts, and other forms of textual data to understand public opinion and trends. In this article, we are going to implement sentiment analysis using RoBERTa model.
4 min read
How to load a huggingface dataset from local path?
Hugging Face datasets â a powerful library that simplifies the process of loading and managing datasets for machine learning tasks. Loading a Hugging Face dataset from a local path can be done using several methods, depending on the structure and format of your dataset. In this comprehensive guide,
6 min read
Zero-Shot Text Classification using HuggingFace Model
Zero-shot text classification is a groundbreaking technique that allows for categorizing text into predefined labels without any prior training on those specific labels. This method is particularly useful when labeled data is scarce or unavailable. Leveraging the HuggingFace Transformers library, we
4 min read
Text2Text Generations using HuggingFace Model
Text2Text generation is a versatile and powerful approach in Natural Language Processing (NLP) that involves transforming one piece of text into another. This can include tasks such as translation, summarization, question answering, and more. HuggingFace, a leading provider of NLP tools, offers a ro
5 min read
Understanding BLIP : A Huggingface Model
BLIP (Bootstrapping Language-Image Pre-training) is an innovative model developed by Hugging Face, designed to bridge the gap between Natural Language Processing (NLP) and Computer Vision (CV). By leveraging large-scale pre-training on millions of image-text pairs, BLIP is adept at tasks such as ima
8 min read
Top 5 Use Cases for Hugging Face Models in 2024
In 2024, the landscape of natural language processing (NLP) is experiencing rapid advancements, with Hugging Face at the helm of these innovations. Hugging Faceâs models, including cutting-edge architectures like GPT-4, BERT, and T5, are driving significant changes across various industries. Here, w
5 min read
Text Summarizations using HuggingFace Model
Text summarization is a crucial task in natural language processing (NLP) that involves generating concise and coherent summaries from longer text documents. This task has numerous applications, such as creating summaries for news articles, research papers, and long-form content, making it easier fo
5 min read
ChatGPT Prompt to get Datasets for Machine Learning
With the development of machine learning, access to high-quality datasets is becoming increasingly important. Datasets are crucial for assessing the accuracy and effectiveness of the final model, which is a prerequisite for any machine learning project. In this article, we'll learn how to use a Chat
7 min read