Skip Gram Model Python Implementation for Word Embeddings

In this project, we worked with the Skip-Gram model, widely used for creating vector representations in NLP. The word embedding in this case refers to the process of transforming words into numbers that can be processed by the computer efficiently. This way the model can show how words are connected semantically and contextually. Consequently, this model type is helpful for, for instance, search engines, recommendations, and text categorization.

Project Overview

This project aims to understand the Skip-Gram model and explores how it can be employed to produce meaning and relationships of words in numerical values called word embeddings. The procedure begins by cleaning and preprocessing the given input text to get rid of any irrelevant data and the data is ready for analysis. We then proceed to build a vocabulary that is geared towards training a neural net whereby the model attempts to guess the context words given a center word.

The model then saves the embeddings generated for some future use and their effectiveness is assessed by looking for analogous terms and measuring the distances of word pairs. For this purpose, we employed t-SNE, which is a dimensionality reduction technique with a particular focus on two- dimensional spaces to facilitate the perception of such embeddings. This suggests that such visual representation helps us understand how closely the words that are related are located to each other.

The interest contains the focus on nearly practical tasks, for instance, searching for keywords to find appropriate ones or dealing with datasets by using word topology. This also combines the technological advancements of machine learning and the enhanced techniques in illustrations, thus providing an understanding of how words are related in a text, which makes it useful for many NLP activities.

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

Python version 3.7 or higher installed on your system.
Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as NLTK, Scikit-learn, Pandas, NumPy, and Matplotlib is necessary.
Jupyter Notebook, VScode, or a Python-compatible IDE.
You must have experience with the PyTorch framework.
The ability to understand how to use text preprocessing techniques is essential.

Approach

The project started with the gathering and cleaning of the text data whereby noise was ensured to be taken out by normalizing and tokenizing the text. Then a vocabulary was created by indexing each word and word counts were done with negative sampling support. Using this vocabulary, positive center-context word pairs with a given window were created while negative samples were created to enhance the learning of the model.

Skip-gram model was instantiated in PyTorch where embedding layers and weight matrices were configured to capture semantic relations. The model underwent training where the Adam optimization technique was utilized in training and batches of data with monitoring of loss to ensure all training was progressive. After the training, all the embedding dimensions and weights were exported for further research. To translate the embeddings to concrete terms, t-SNE was employed to reduce the dimensions after which the word embeddings were plotted to depict the clusters and how the words relate with each other semantically. Distant and similarity measurements were made on the embeddings to assess their performance in representing word relationships.

Workflow and Methodology

Workflow

Data Preparation: Acquire and prepare raw text by eliminating noise and standardizing the text for uniformity in processing.
Vocabulary Designing: Design a vocabulary with an index for words in it and prepare word counts for negative sampling.
Sample Generation: Create a positive center–context word pair and negative samples using a context sliding window.
Model Development: Implement the PyTorch Skip-Gram model with the help of Adam optimizer and keep track of the changes in loss function during training.
Storage of Embedding: Save the word embeddings and the associated weights after training for later use in various tasks.
Visualization: Use t-SNE to visualize the learned embeddings in two dimensions and show the relationship of words in a two-dimensional figure.
Validation: Validate the embeddings generated by computing the similarity between words and their distance from each other with their meaning.

Methodology

Preprocessed the text to clean and tokenize for effective input to the model.
Built a Skip-Gram neural network with embedding layers to capture semantic word relationships.
Trained the model on center-context pairs while incorporating negative sampling for enhanced learning.
Applied t-SNE to reduce embedding dimensions for visualizing semantic groupings of words.
Assessed embeddings through distance and similarity metrics to ensure meaningful word representations.

Data Collection and Preparation

Data Collection:3

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Load the text data in DataFrame for further processing.
Clean data by removing missing values so that data is complete.
Lowercase text for uniformity.
Using regular expressions remove special characters, numbers, and repeated sequences.
Reformatting multiple spaces to have clean up for formatting.
NLTK's word_tokenize will tokenize text into words
Store tokenized data as a pickle file for some other future use.

Code explanation

Here’s what is happening under the hood. Let’s go through it step by step:

Step 1:

Mount Google Drive

Mount your Google Drive to access and save datasets, models, and other resources.

from google.colab import drive
drive.mount('/content/drive')

Library Import

The necessary libraries for processing language, managing data, performing machine learning, and creating visualizations are imported into this code. It consists of deep learning libraries such as PyTorch, Text processing libraries such as nltk, various metrics libraries such as sklearn, and graph plotting libraries such as matplotlib. All these tools assist in performing certain tasks such as evaluating the performance of an algorithm, training the model, and analyzing the text.

import os
import re
import torch
import pickle
import numpy as np
import pandas as pd
from tqdm import tqdm
import torch.nn as nn
import torch.optim as optim
import matplotlib.pyplot as plt
from collections import Counter
from nltk.tokenize import word_tokenize
from mpl_toolkits.mplot3d import Axes3D
from sklearn.metrics import classification_report

Acquiring the Punkt Sentence Tokenizer from the NLTK Library

This code is for downloading the punkt tokenizer data available in the NLTK library. It facilitates the process by breaking down the text into sentences and parsing the sentences into words for all the natural language processing operations.

import nltk
nltk.download('punkt')

Saving and Loading Pickle Files

These functions preserve and reproduce Python objects with the help of the pickle module. save_file saves an object to the disk while load_file pulls an object from a pickle file.

def save_file(name, obj):
"""
Function to save an object as pickle file
"""
with open(name, 'wb') as f:
pickle.dump(obj, f)
def load_file(name):
"""
Function to load a pickle object
"""
return pickle.load(open(name, "rb"))

STEP 2:

Loading Data and Checking Shape:

This code loads the CSV file. After loading the dataset, it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

tokens_path = "/content/drive/MyDrive/New 90 Projects/Project_9/tokens.pkl"
file_path = "/content/drive/MyDrive/New 90 Projects/Project_9/complaints.csv"
col_name = "Consumer complaint narrative"
data = pd.read_csv(file_path)
data.shape

Dropping the Missing Data and Checking the Shape of the Data

This code deletes all the rows that do not have any value in the Consumer complaint narrative column. After that, the shape of the updated dataset is displayed, portraying the number of rows and columns remaining.

data.dropna(subset=[col_name], inplace=True)
data.shape

Extracting the text column and showing the sample data

The code extracts the Consumer complaint narrative column to input_text for further analysis. Afterward, it uses head() to show some of the dataset’s first few rows.

input_text = data[col_name]
data.head()

Converting Text to Lowercase

This script utilizes a list comprehension method to transform all the alphabets present in input_text in the list to the small case for ease of processing the text. The progress in the operations is shown by tqdm.

input_text = [i.lower() for i in tqdm(input_text)]

Eliminating Special Character from a String

This piece of code eliminates special characters except for words, numbers, and spaces in the input_text with the regular expression. This helps in cleaning the text and makes it ready for further processing.

input_text = [re.sub(r"[^\w\d'\s]+", " ", i) for i in tqdm(input_text)]

Removing the digits in a text

With the use of regular expressions, this code removes all numeric characters from the input_text, hence allowing for the processing of only the text content.

input_text = [re.sub("\d+", "", i) for i in tqdm(input_text)]

Eliminating Consecutive 'x' Characters

The purpose of this particular code is to erase all sequences in input_text that consist of two or more consecutive 'x' characters using regular expression thereby enhancing the cleanup of the text.

input_text = [re.sub(r'[x]{2,}', "", i) for i in tqdm(input_text)]

Removing Extra Spaces

This code removes multiple consecutive spaces in the input_text and replaces them with a single space, ensuring cleaner text formatting.

input_text = [re.sub(' +', ' ', i) for i in tqdm(input_text)]

Tokenizing Texts to Words

Here, for the first 100 data in the input_text, we apply NLTK's word_tokenize. Each entry turns into a list of tokens, which are further used for processing.