Project Overview
This project aims to understand the Skip-Gram model and explores how it can be employed to produce meaning and relationships of words in numerical values called word embeddings. The procedure begins by cleaning and preprocessing the given input text to get rid of any irrelevant data and the data is ready for analysis. We then proceed to build a vocabulary that is geared towards training a neural net whereby the model attempts to guess the context words given a center word.
The model then saves the embeddings generated for some future use and their effectiveness is assessed by looking for analogous terms and measuring the distances of word pairs. For this purpose, we employed t-SNE, which is a dimensionality reduction technique with a particular focus on two- dimensional spaces to facilitate the perception of such embeddings. This suggests that such visual representation helps us understand how closely the words that are related are located to each other.
The interest contains the focus on nearly practical tasks, for instance, searching for keywords to find appropriate ones or dealing with datasets by using word topology. This also combines the technological advancements of machine learning and the enhanced techniques in illustrations, thus providing an understanding of how words are related in a text, which makes it useful for many NLP activities.
Prerequisites
Learners must develop some skills before undertaking this project. Here’s what you should ideally know:
- Python version 3.7 or higher installed on your system.
- Understanding of basic knowledge of Python for data analysis and manipulation
- Knowledge of libraries such as NLTK, Scikit-learn, Pandas, NumPy, and Matplotlib is necessary.
- Jupyter Notebook, VScode, or a Python-compatible IDE.
- You must have experience with the PyTorch framework.
- The ability to understand how to use text preprocessing techniques is essential.
Approach
The project started with the gathering and cleaning of the text data whereby noise was ensured to be taken out by normalizing and tokenizing the text. Then a vocabulary was created by indexing each word and word counts were done with negative sampling support. Using this vocabulary, positive center-context word pairs with a given window were created while negative samples were created to enhance the learning of the model.
Skip-gram model was instantiated in PyTorch where embedding layers and weight matrices were configured to capture semantic relations. The model underwent training where the Adam optimization technique was utilized in training and batches of data with monitoring of loss to ensure all training was progressive. After the training, all the embedding dimensions and weights were exported for further research. To translate the embeddings to concrete terms, t-SNE was employed to reduce the dimensions after which the word embeddings were plotted to depict the clusters and how the words relate with each other semantically. Distant and similarity measurements were made on the embeddings to assess their performance in representing word relationships.
Workflow and Methodology
Workflow
- Data Preparation: Acquire and prepare raw text by eliminating noise and standardizing the text for uniformity in processing.
- Vocabulary Designing: Design a vocabulary with an index for words in it and prepare word counts for negative sampling.
- Sample Generation: Create a positive center–context word pair and negative samples using a context sliding window.
- Model Development: Implement the PyTorch Skip-Gram model with the help of Adam optimizer and keep track of the changes in loss function during training.
- Storage of Embedding: Save the word embeddings and the associated weights after training for later use in various tasks.
- Visualization: Use t-SNE to visualize the learned embeddings in two dimensions and show the relationship of words in a two-dimensional figure.
- Validation: Validate the embeddings generated by computing the similarity between words and their distance from each other with their meaning.
Methodology
- Preprocessed the text to clean and tokenize for effective input to the model.
- Built a Skip-Gram neural network with embedding layers to capture semantic word relationships.
- Trained the model on center-context pairs while incorporating negative sampling for enhanced learning.
- Applied t-SNE to reduce embedding dimensions for visualizing semantic groupings of words.
- Assessed embeddings through distance and similarity metrics to ensure meaningful word representations.
Data Collection and Preparation
Data Collection:3
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Load the text data in DataFrame for further processing.
- Clean data by removing missing values so that data is complete.
- Lowercase text for uniformity.
- Using regular expressions remove special characters, numbers, and repeated sequences.
- Reformatting multiple spaces to have clean up for formatting.
- NLTK's word_tokenize will tokenize text into words
- Store tokenized data as a pickle file for some other future use.