ALBERT - A Light BERT for Supervised Learning
Last Updated :
27 Jan, 2022
The BERT was proposed by researchers at Google AI in 2018. BERT has created something like a transformation in NLP similar to that caused by AlexNet in computer vision in 2012. It allows one to leverage large amounts of text data that is available for training the model in a self-supervised way.
ALBERT was proposed by researchers at Google Research in 2019. The goal of this paper to improve the training and results of BERT architecture by using different techniques like parameter sharing, factorization of embedding matrix, Inter sentence Coherence loss.
Model architecture:
The backbone of ALBERT architecture is similar to BERT that is encoder layers with GELU (Gaussian Error Linear Unit) activation function. However, below are the three main changes that are present in ALBERT but not in BERT.
- Factorization of the Embedding matrix: In the BERT model and its improvements such as XLNet and ROBERTa, the input layer embeddings and hidden layer embeddings have the same size. But in this model, the authors separated the two embedding matrices. This is because input-level embedding (E) needs to refine only context-independent learning but hidden level embedding (H) requires context-dependent learning. This step leads to a reduction in parameters by 80% with a minor drop in performance when compared to BERT.
- Cross-layer parameter sharing: The authors of this model also proposed the parameter sharing between different layers of the model to improve efficiency and decrease redundancy. The paper proposed that since the previous versions of BERT, XLNet, and ROBERTa have encoder layer stacked on top of one another causes the model to learn similar operations on different layers. The authors proposed three types of parameter sharing in this paper:
- Only share Feed Forward network parameter
- Only share attention parameters
- Share all parameters. Default setting used by authors unless stated otherwise.
The above step leads to a 70% reduction in the overall number of parameters.
- Inter Sentence Coherence Prediction: Similar to the BERT, ALBERT also used Masked Language model in training. However, Instead of using NSP (Next Sentence Prediction) loss, ALBERT used a new loss called SOP (Sentence Order Prediction). NSP is a binary classification loss for predicting whether two segments appear consecutively in the original text, the disadvantage of this loss is that it checks for coherence as well as the topic to identify the next sentence. However, the SOP only looks for sentence coherence.
ALBERT is released in 4 different model sizes,
Model | Size | Parameters | Encoder Layers (L) | Embedding (E) | Hidden units (H) |
---|
BERT | Base | 108 M | 12 | 768 | 768 |
Large | 334 M | 24 | 1024 | 1024 |
ALBERT | Base | 12 M | 12 | 128 | 768 |
Large | 18 M | 24 | 128 | 1024 |
X Large | 60 M | 24 | 128 | 2048 |
XX Large | 235 M | 12 | 128 | 4096 |
As we can see from the above table is the ALBERT model has a smaller parameter size as compared to corresponding BERT models due to the above changes authors made in the architecture. For Example, BERT base has 9x more parameters than the ALBERT base, and BERT Large has 18x more parameters than ALBERT Large.
Dataset used:
Similar to the BERT, ALBERT is also pre-trained on the English Wikipedia and Book CORPUS dataset which together contains 16 GB of uncompressed data.
Implementation:
- In this implementation, we will use a pre-trained ALBERT model using TF-Hub and ALBERT GitHub repository. We will run the model on Microsoft Research Paraphrase Corpus (MRPC) dataset on GLUE benchmark.
Python3
# Clone ALBERT Repo
! git clone https://github.com/google-research/albert
# Install Requirements of ALBERT
! pip install -r albert/requirements.txt
# clone GLUE repo into a folder
! test -d download_glue_repo ||
git clone https://gist.github.com/60c2bdb54d156a41194446737ce03e2e.git glue_repo
# Download MRPC dataset
!python glue_repo/download_glue_data.py --data_dir=/content/MRPC --tasks='MRPC'
# Describe the URL of TFhub ALBERT BASE model
ALBERT_MODEL_HUB = 'https://tfhub.dev/google/albert_base/3'
# Fine Tune ALBERT classifier on MRPC dataset
# To select best hyperparameter of any task of GLUE
# benchamrk look into run_glue.sh
!python -m albert.run_classifier \
--data_dir=MRPC/ \
--output_dir=output/ \
--albert_hub_module_handle=$ALBERT_MODEL_HUB \
--spm_model_file="from_tf_hub" \
--do_train=False \
--do_eval=True \
--do_predict=True \
--max_seq_length=512 \
--optimizer=adamw \
--task_name=MRPC \
--warmup_step=200 \
--learning_rate=2e-5 \
--train_step=800 \
--save_checkpoints_steps=100 \
--train_batch_size=32
Results & Conclusion:
Despite the much fewer number of parameters, ALBERT has achieved the state-of-the-art of many NLP tasks. Below are the results of ALBERT on GLUE benchmark datasets. The ALBER
ALBERT results as compared to other models on GLUE benchmark.
Below are the results of the ALBERT-xxl model on SQuAD and RACE benchmark datasets.
Here, ALBERT (1M) represents model is trained with 1M steps whereas, ALBERT 1.5M represents the model is trained with 1.5M epoch.
As of now, the authors have also released a new version of ALBERT (V2), with improvement in the average accuracy of the BASE, LARGE, X-LARGE model as compared to V1.
Version | Size | Average Score |
---|
ALBERT V2 | Base | 82.3 |
Large | 85.7 |
X-Large | 87.9 |
XX-Large | 90.9 |
ALBERT V1 | Base | 80.1 |
Large | 82.4 |
X-Large | 85.5 |
XX-Large | 91.0 |
References:
Similar Reads
Semi Supervised Learning Examples
Semi-supervised learning is a type of machine learning where the training dataset contains both labeled and unlabeled data. This approach is useful when acquiring labeled data is expensive or time-consuming but unlabeled data is readily available. In this article, we are going to explore Semi-superv
5 min read
Supervised Machine Learning Examples
Supervised machine learning technology is a key in the world of the dramatic innovations of the modern AI. It is applied in numerous items, such as coat the email and the complicated one, self-driving carsOne of the most important tasks when it comes to supervised machine learning is making computer
7 min read
LightGBM Histogram-Based Learning
In the era of Machine learning and Data science, various algorithms and techniques are used to handle large datasets for solving real-world problems effectively. Like various machine learning models, one revolutionary innovation is the LightGBM model which utilizes a high-performance gradient boosti
11 min read
Top 20 ChatGPT Prompts For Machine Learning
Machine learning has made significant strides in recent years, and one remarkable application is ChatGPT, an advanced language model developed by OpenAI. ChatGPT can engage in natural language conversations, making it a versatile tool for various applications. In this article, we will explore the to
10 min read
C++ Libraries for Machine Learning
Machine learning (ML) has significantly transformed various industries by enabling systems to learn from data and make predictions. While Python is often the go-to language for ML due to its extensive libraries and ease of use, C++ is increasingly gaining attention for ML applications. C++ offers su
5 min read
Fine-tuning BERT model for Sentiment Analysis
Google created a transformer-based machine learning approach for natural language processing pre-training called Bidirectional Encoder Representations from Transformers. It has a huge number of parameters, hence training it on a small dataset would lead to overfitting. This is why we use a pre-train
6 min read
How Should a Machine Learning Beginner Get Started on Kaggle
Are you fascinated by Data Science? Do you think Machine Learning is fun? Do you want to learn more about these fields but arenât sure where to start? Well, start with Kaggle! Kaggle is an online community devoted to Data Scientists and Machine Learning, founded by Google in 2010. It is the largest
8 min read
Talking Healthcare Chatbot using Deep Learning
Today in this article we are going to see how we can build a Talking Healthcare Chatbot using Deep Learning. It is recommended to know the basics of Deep Learning, Intermediate knowledge of Python and the theory of Neural Networks. Users should also be familiar with how to use the SpeechRecognition
14 min read
Wav2Vec2: Self-A Supervised Learning Technique for Speech Representations
In the ever-evolving landscape of artificial intelligence, the quest for efficient and versatile models has led researchers to explore innovative training paradigms. Among these, self-supervised learning has emerged as a frontrunner, offering a promising solution to the perennial challenge of acquir
14 min read
5 Machine Learning Projects to Implement as a Beginner
From recommendation engines in streaming platforms to predictive models in healthcare machine learning became an integral part of our lives. Whether you're automating simple tasks or developing AI applications machine learning holds immense potential for innovation.In this article weâll discuss 5 Ma
3 min read