Image

Build Multi-Class Text Classification Models with RNN and LSTM

How many times do you come across large volumes of text data and pause to think of an easier way to decipher it? This is where this project comes in handy as it employs sophisticated RNNs and LSTM techniques in its implementation. It aims at intelligently classifying real-life customer complaints. Be it identity theft or credit card issues, this project takes the text and applies it to real-life problems.

Project Overview

The purpose of this undertaking is to investigate the classification of text as comprehensively as possible using Python, PyTorch, and Natural Language Processing. To begin with the proper data set; cleansing, formatting, and vectorization of corporeal text into GloVe embeddings is explained. Following that the RNN and the LSTM neural networks where the models have already been trained on sample data to predict the category of the complaint are constructed using the tokenized data. The models developed are checked for performance using accuracy and confusion matrices among other metrics.

Step by step and with good code, the course demonstrates how to normalize text, create embeddings, and build classification models. It’s a practical approach to understand the concepts of Natural language processing, deep learning, and the use of AI for problem-solving. Ideal for programmers, data analysts, and enthusiasts of machine learning!

Prerequisites

Before commencing this project, ensure that you have the following skills and tools:

  • Python: You need to know how to program in Python and use libraries such as NumPy and Pandas.
  • Natural Language Processing Basics: You should be familiar with how tokenization works, what embeddings are, and basic text preprocessing techniques.
  • Knowledge of PyTorch: Knowing how to use PyTorch in the creation and training of neural networks is a must.
  • Machine Learning: Knowledge in classification, classification loss, and accuracy.
  • GloVe Embeddings: Be able to explain how and why word embeddings are useful in representing and manipulating text data.
  • Tools Installed: Confirm that NLTK, Scikit-learn, Matplotlib, and Seaborn libraries are present in your Python environment.

Provided all these prerequisites are in place, you are good to go! Let’s begin with the art of text classification!

Approach

This project carries out text classification using deep learning models such as RNN and LSTM in an organized manner. The project commences with data preprocessing. In this case, the raw text is cleaned of unwanted characters, such as digits and other special characters, as well as excess whitespace. Each text fragment is processed into tensors of a certain size using tokenization. Then turns of these tokens are converted from words to relevant quantized vectors using GloVe embedding techniques that help in maintaining the semantic relationships among the words.

The prepared data comprises three datasets, for training, validation, and model testing. To optimally fill in the data inside the model, the custom data sets and data loaders based on the Pytorch framework are created. Recurrent Neural Networks (RNN) and Long Short Term Memory (LSTM) are established. The models are trained using cross-entropy loss which has been subjected to Adam optimization and also validated on the training for overfitting.

Once training is complete, the models are now validated against new datasets that were not presented to them. The performance evaluation takes place using metrics such as accuracy, confusion matrices, and classification reports. The methodology is straightforward and clear and comprises simple but powerful text processing techniques combined with deep learning approaches to promote an effective text classification strategy.

Workflow and Methodology

Workflow

  • First load and collect the dataset.
  • Preprocess it by eliminating unnecessary factors, preparing and even correcting the missing data, and synthesizing, or reorganizing existing information.
  • Perform tokenization of the textual information and encode it into a vectorized form by employing GloVe.
  • Organize the data into training, validation, and test sets to facilitate the evaluation of the model.
  • Implement the training process of RNN and LSTM networks with PyTorch framework and with cross-entropy loss and Adam optimizer.
  • Measure how well the trained model assesses and classifies the data using the following performance measurement tools: accuracy, confusion matrix, and classification report.

Methodology

  • Make use of the GloVe embeddings to represent text data in an appropriate form.
  • Utilize RNN and LSTM networks to model temporal patterns of complaint Narratives.
  • Improve the performance of the networks by reducing the given loss function and also monitoring the validation phase.
  • Keep the network that has the best performance on the validation data set so that the performance on the test set is as good as possible.
  • Evaluate the test predictions with the help of confusion matrices and extensive classification reports.

Data Collection and Preparation

Data Collection:

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

  • The dataset should be imported and preprocessed including noise removal and treatment of empty or missing data.
  • Perform text tokenization, padding or truncating, and token mapping into Glove embedding vectors.
  • For model assessment, the data should be split into training, validation, and test sets.

Code explanation

Here’s what is happening under the hood. Let’s go through it step by step:

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Installing Required Libraries

This code installs a few important libraries: NLTK which allows natural language processing, NumPy for working with arrays, Pandas when dealing with data, Pytorch for deep learning tasks, TQDM provides easy-to-use progress bars, and Scikit-learn is a package used for machine learning applications.

# import the required packages
!pip install nltk
!pip install numpy
!pip install pandas
!pip install torch
!pip install tqdm
!pip install scikit_learn

Importing Libraries and Handling Warnings

This piece of code imports fundamental libraries required for natural language processing, building and implementing machine learning models, and visualizing the information generated with the help of NLTK, PyTorch, and Scikit-learn. Moreover, certain warnings are turned off to keep the workspace clean and organized.

import nltk
nltk.download('punkt')
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
from nltk.tokenize import word_tokenize
from sklearn.metrics import accuracy_score
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
import warnings
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score
from sklearn.exceptions import UndefinedMetricWarning
import torch
import matplotlib.pyplot as plt
import seaborn as sns
from tqdm import tqdm
# Suppress specific warnings
warnings.filterwarnings("ignore", category=UserWarning)
warnings.filterwarnings("ignore", category=FutureWarning)
warnings.filterwarnings("ignore", category=UndefinedMetricWarning)

Defining Configuration and File Paths

This code snippet initializes important configurations such as learning rate, input dimensions, and model files used for training and testing. It also defines directories for data, models, tokenizers, embeddings, and product mapping for uniform label encoding.

# define configuration file paths
lr = 0.0001
input_size = 50
num_epochs = 50
hidden_size = 50
label_col = "Product"
tokens_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/tokens.pkl"
labels_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/labels.pkl"
data_path = "/content/drive/MyDrive/New 90 Projects/Project_11/Data/complaints.csv"
rnn_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_rnn.pth"
lstm_model_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/model_lstm.pth"
vocabulary_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/vocabulary.pkl"
embeddings_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/embeddings.pkl"
glove_vector_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/glove.6B.50d.txt"
text_col_name = "Consumer complaint narrative"
label_encoder_path = "/content/drive/MyDrive/New 90 Projects/Project_11/model/label_encoder.pkl"
product_map = {'Vehicle loan or lease': 'vehicle_loan',
           'Credit reporting, credit repair services, or other personal consumer reports': 'credit_report',
'Credit card or prepaid card': 'card',
'Money transfer, virtual currency, or money service': 'money_transfer',
'virtual currency': 'money_transfer',
'Mortgage': 'mortgage',
'Payday loan, title loan, or personal loan': 'loan',
'Debt collection': 'debt_collection',
'Checking or savings account': 'savings_account',
'Credit card': 'card',
'Bank account or service': 'savings_account',
'Credit reporting': 'credit_report',
'Prepaid card': 'card',
'Payday loan': 'loan',
'Other financial service': 'others',
'Virtual currency': 'money_transfer',
'Student loan': 'loan',
'Consumer Loan': 'loan',
'Money transfers': 'money_transfer'}

Functions to Save and Load a File

This code introduces two functions: saving an object in a pickle format using the function save_file and bringing back the object into existence using the function load_file.

# define function for saving a file
def save_file(name, obj):
"""
Function to save an object as pickle file
"""
with open(name, 'wb') as f:
pickle.dump(obj, f)
# define function for loading a file
def load_file(name):
"""
Function to load a pickle object
"""
return pickle.load(open(name, "rb"))

STEP 2:

Reading GloVe Embeddings

This code reads the GloVe embeddings file and calculates the total number of unique words.

# open the glove embeddings file and read
with open(glove_vector_path, "rt") as f:
emb = f.readlines()
# 400000 unique words are there in the embeddings (length of embeddings)
len(emb)

Retrieval of the First Entry in the GloVe Embedding Dataset

This piece of code fetches the first entry from the GloVe embeddings file and prints out the first word along with a vector assigned to it.

# check first record
emb[0]

Extracting the Initial Word from GloVe Embeddings

The following code separates the initial word present in the first entry of the GloVe embeddings from its respective vector values.

# split the first record and check for vocabulary
emb[0].split()[0]

Extraction of Embedding Values of the First Word

The given code snippet extracts the details or vector values (embeddings) corresponding to the first word in the first line of the GloVe embedding file.

# split the first record and check for embeddings
emb[0].split()[1:]

Forming a Vocabulary and Creating Embeddings Array

This code creates a vocabulary of words and associated embeddings matrix using GloVe embeddings. The embedding matrix is transformed into a float32 numpy array whose shape denotes the number of words and their respective vector sizes.

vocabulary, embeddings = [], []
for item in emb:
vocabulary.append(item.split()[0])
embeddings.append(item.split()[1:])
# Convert embeddings to numpy float array
embeddings = np.array(embeddings, dtype=np.float32)
embeddings.shape

Displaying the First 10 Words in the Vocabulary

This section of code fetches the first ten entries appearing on the lexicon constructed along with the GloVe embedding and outputs them for visualization purposes.

vocabulary[:10]

Updating Vocabulary and Embeddings

The algorithm incorporates special tokens \<pad> and \<unk> to the lexical content and attaches the respective vectors: all ones for the \<pad> vector and the mean of the existing embeddings for \<unk>. It modifies the embeddings matrix correspondingly and displays the new vocabulary size and shape of embeddings.

vocabulary = ["\", "\"] + vocabulary
embeddings = np.vstack([np.ones(50, dtype=np.float32), np.mean(embeddings, axis=0),
                        embeddings])
print(len(vocabulary), embeddings.shape)

Saving Embeddings and Vocabulary

The code saves the revised embedding matrix and the vocabulary list to pickle files using the assigned directory paths for potential use in the future.

save_file(embeddings_path, embeddings)
save_file(vocabulary_path, vocabulary)

STEP 3:

Importing and Processing Data

The code loads the dataset provided in CSV format, deletes the corresponding entries of the rows where they have null values in the text column, and also normalizes the target column by clearing the existing duplicates using the pre-developed product map.

#Read the data file
Aionlinecourse_data = pd.read_csv(data_path)
# Drop rows where the text column is empty
Aionlinecourse_data.dropna(subset=[text_col_name], inplace=True)
#Replace duplicate labels
Aionlinecourse_data.replace({label_col: product_map}, inplace=True)

Label Encoding

The provided code implements a LabelEncoder, applies it to the dataset’s target labels, and encodes them into integers. It then returns the numeric representation of the first label.

label_encoder = LabelEncoder()
label_encoder.fit(Aionlinecourse_data[label_col])
labels = label_encoder.transform(Aionlinecourse_data[label_col])
labels[0]

Retrieving Label Classes

This code displays all the unique label classes that the LabelEncoder has learned from the dataset.

label_encoder.classes_

Observing the Target Column

The following code obtains the values in the target column (label_col) from the dataset and presents it, as it appears before the label is converted into an encoded form.

Aionlinecourse_data[label_col]

Saving Labels and Label Encoder

The code saves the numeric labels and the trained LabelEncoder as pickle files for later use in the project.

save_file(labels_path, labels)
save_file(label_encoder_path, label_encoder)

Processing Text Input

This code transforms the entire text in the given column into lowercase so that all the entries are the same during processing. It has a progress bar to show the estimate of how much the conversion has gone.

input_text = Aionlinecourse_data[text_col_name]
# Convert text to lower case
input_text = [i.lower() for i in tqdm(input_text)]

Eliminating Special Characters in Text

The script eliminates special characters, punctuation, and symbols from the corresponding text, which only allows words, numbers, and spaces. This guarantees cleaner input for the subsequent steps.

# Remove punctuations except apostrophe
input_text = [re.sub(r"[^\w\d'\s]+", " ", i) for i in tqdm(input_text)]

Removing Numbers From Text

The purpose of this code is to remove any numeric digits present in the text to concentrate only on the words which in turn makes the data more clean for analysis