How to Fine-tune HuggingFace BERT model for Text Classification
hugging face BERT model is a state-of-the-art algorithm that helps in text classification. It is a very good pre-trained language model which helps machines to learn from millions of examples and extracts features from each sentence. In 2018, this powerful Transformer based machine learning model was developed by Jacob Devlin and his colleagues (Researchers at the University of Cambridge and Google brain Team) for NLP applications. It was first introduced in this paper and first released in this repository. Before learning about the BERT model you should have a basic understanding of Transformer architecture.
Transformer architecture has an encoder and decoder stack whereas BERT is just an encoder stack of Transformer architecture. The two variants BERT-base and BERT-large defer in architecture complexity. In the encoder, the base model has 12 layers whereas the large model has 24 layers.
Nowadays, text classification is one of the most interesting domains in the field of NLP. Actually, it is the process of assigning a category to a text document based on its content. For example, in a given email, people want to classify it as spam or not spam, actually text classification finds its applications in one form or the other. The algorithms used for this purpose are called classifiers and they work by extracting features from each sentence in order to find patterns that match with the categories they have been trained for.
BERT is an encoder transformers model which pre-trained on a large scale of the corpus in a self-supervised way. Actually, it was pre-trained on the raw data only, with no human labeling, and with an automatic process to generate inputs labels from those data. More specifically it was pre-trained with two objectives.
1. Masked Language Modeling (MLM): It is different from traditional recurrent neural networks (RNN). This model takes a sentence and randomly masks 15% of the words in the input then runs the entire masked sentence through the model and has to predict the masked words.
2. Next Sentence Prediction (NSP): During pre-training, this model concatenates two masked sentences as inputs. Then it has to predict the two sentences were following each other or not.
Different Fine-Tuning Techniques:
1. Train the entire architecture
2. Train some layers while freezing others
3. Freeze the entire architecture
Here in this tutorial, we will use the third technique and during fine-tuning freeze all the layers of the BERT model. If you are interested to learn more about the BERT model, then you may like to read this article.
Fine-Tune HuggingFace BERT for Spam Classification
At the very first we have collected some SMS messages (some of these are spam and the rest are not spam). Our goal is to build a system that will automatically detect a message is spam or not spam. You will find the dataset here, that we have been used to train and test our model.
[ I have a suggestion for you, please use Google Colab to perform this task and activate the “GPU runtime”. ]
Install Transformers Library
We will install Huggingface’s transformers library and this library helps us to import a wide range of transformer-base pre-trained models.
pip install transformers
import numpy as np
df = pd.read_csv("spamdata_v2.csv")
Import BERT Model and BERT Tokenizer
# import BERT-base pretrained model
bert = AutoModel.from_pretrained('bert-base-uncased')
# Load the BERT tokenizer
Then we will try to encode a couple of sentences using this tokenizer.
# sample data
Tokenize the sentences
# get length of all the messages in the train set
Define Model Architecture
We mentioned earlier in this article that we would freeze all the layers of the model before fine-tuning it.
# freeze all the parameters
for param in bert.parameters():
param.requires_grad = False
Moving on we will now define our model architecture.
# set initial loss to infinite
Training Loss: 0.592
To make predictions, we will first load the best model weights to save during the training process.
#load weights of best model
Here in this article, you can learn how to fine-tune a pre-trained BERT model to perform text classification on a given dataset. For more information, you can follow this article.
In case you are looking for a more Fine-tuning approach, you can follow this article.
There are multiple approaches to fine-tune BERT for the target tasks.
1. Further Pre-training the base BERT model
2. Train the entire base BERT model.
3. Used two different models where the base BERT model is non-trainable and another one is trainable.
Hence, the base BERT model is half-baked which can be fully baked for the target domain (1st way). We can use it as part of our custom model training with the base trainable (2nd) or not-trainable (3rd).
How to Fine-Tune BERT for Text Classification? demonstrated the 1st approach of Further Pre-training. For a text classification task in a specific domain, data distribution is different from the general domain corpus. That's why have used Further pre-train BERT with a masked language model. Further pre-training approach is performed in three ways.
1. Within-task pre-training
2. In-domain pre-training
3. Cross-domain pre-training
There is a challenge in the Transfer learning pre-trained model. Here the previous learning data is erased during learning new information. Which is called Catastrophic Forgetting. The main target of this approach is to avoid the Catastrophic Forgetting problem and fine-tune the BERT model with different learning rates. Here they have found that a lower learning rate such as 2e-5 is fundamental to making BERT overcome the problem. With an aggressive learning rate such as 4e-4, the training set fails to converge.
Probably for this reason BERT paper used 5e-5, 4e-5, 3e-5, and 2e-5 for fine-tuning.
Note that, they have used the uncased BERT-base model for English text classification, and for Chinese text classification they have used the Chinese BERT-base model.
Here they have used the BERT-base model with a hidden size of 768, 12 Transformer blocks, and 12 self-attention heads. They further pre-train with BERT on 1 TITAN Xp GPU, with a batch size of 32. The dropout probability is always kept at 0.1. They have used Adam with an optimized learning rate of 1e-4, β1 = 0.9, and β2 = 0.999, and the warm-up proportion is 0.1. They have set the max number of the epoch to 4 and saved the best model on the validation set for testing.
Huggingface takes the 2nd approach as in Fine-tuning with native PyTorch/TensorFlow. Here they will show you how to fine-tune the transformer encoder-decoder model for downstream tasks. For training with PyTorch and TensorFlow, you have to use the dataset library, load and preprocess the dataset. Before training make sure you have installed the dataset library. Here you will find the installation process. They have used the following three ways to fine-tune a model.
1. Sequence classification with IMDb reviews
Sequence classification refers to the task of classifying sequences of text according to a given number of classes. Here you can learn how to fine-tune a model on the IMDb dataset and determine a result is positive or not. They have used the DatasetDict object to load the dataset on the model. Then load some tokenizers to ensure appropriately tokenized words and create a tokenized_imdb function for preprocessing the datasets.
Then they have loaded their model with the AutoModelForSequenceClassification class and TFAutoModelForSequenceClassification class along with the number of expected labels. Then compile the model and fine-tune the model with “model.fit”.
2. Token classification with WNUT emerging entities
Token classification refers to the task of classifying individual tokens in a sentence. Named Entity Recognition (NER), the most common token classification task attempts to find a label for each entity in a sentence. Here you can learn how to fine-tune a model on the WNUT17 dataset to detect new entities. They have used the “wnut” object to load the dataset on the model. Then load some tokenizers to tokenize the text and load DistilBERT tokenizer with an autoTokenizer and create a “tokenizer” function for preprocessing the datasets.
Then they have loaded their model with the AutoModelForTokenClassification class and TFAutoModelForTokenClassification class along with the number of expected labels. Then compile the model and fine-tune the model with “model.fit”.
3. Question Answering with SQuAD
There are various types of question answering (QA) tasks, But extractive QA focuses on identifying the answer from the given question. Here you can learn how to fine-tune a model on the SQuAD dataset. They have used the “squad” object to load the dataset on the model. Then load some tokenizers to tokenize the text and load DistilBERT tokenizer with an autoTokenizer and create a “tokenizer” function for preprocessing the datasets.
Then they have loaded their model with the AutoModelForQuestionClassification class and TFAutoModelForQuestionClassification class along with the number of expected labels. Then compile the model and fine-tune the model with “model.fit”.
Huggingface takes the 2nd approach as in A Visual Guide to Using BERT for the First Time. Here they have used a pre-trained deep learning model to process their data. Then they have used the output of that model to classify the data. Actually, the data is a list of sentences from film reviews. And they will classify each sentence as either “positively” or “negatively”. Here you can understand how they have used a variant of the BERT model to classify sentences.
Here you can learn how to fine-tune a model on the SST2 dataset which contains sentences from movie reviews and labeled either positive (has the value 1) or negative (has the value 0). Their goal is to create a model that takes a sentence and produces either 1 (a positive sentiment) or 0 (a negative sentiment). For this, they have combined two different models.
1. DistilBERT: This model processes the sentence and passes with some information to the next model.
2. Logistic Regression: This model will take the result of DistilBERT’s processing, and classify the result as either positive or negative (1 or 0).
They have used a vector size of 768 to pass the data between two models.
The DistilBERT is a pre-trained model that’s why they have only trained the logistic regression model. The transformers library provides them with the implementation of DistilBERT as well as pre-trained versions of the model. They have first used the trained DistilBERT to generate data for 2000 sentences. Then they have trained the logistic regression model on that training dataset.
For preprocessing the dataset, they have used the BERT tokenizer to split the word into tokens and added some special tokens for sentence classification. After that “last_hidden_states” function finds out the outputs of DistilBERT.
Thank you for reading this article. If you have any questions, please comment below.