Word2Vec and FastText Word Embedding with Gensim in Python

This project deals with applying various techniques from Natural Language Processing to the analysis and visualization of data in text form. We will use some of the most popular models: Word2Vec (CBOW, Skip-Gram) and FastText to process and explore text and further extract extremely crucial insights such as its word similarities and analogies. The objective is to understand how machine learning models capture relations between words.

Project Overview

This study uses Natural Language Processing (NLP) methods for text-document analysis via word embedding models. First, the text data is pre-processed, cleansing, tokenizing, and removing stopwords. Thereafter, three popular models- CBOW, Skip-Gram, and FastText-are trained on the dataset, creating vector representations capable of defining the relationship of words with each other.

Afterward, it will explore the following: word similarity, analogical reasoning- for instance, 'doctor + medicine' - hospital and outlier detection of groups of words. To better understand the performance of these models, we implement diminution techniques PCA and t-SNE to illustrate how devices work on arranging words in a 2D space. The project is geared towards comparing how different embeddings capture word meanings and relations: CBOW, Skip-Gram, and FastText.

Prerequisites

Python Programming: Basic knowledge of Python is relevant to this work.
NLP Basics: Familiarity with typologies like tokenization and stopword removal along with word embeddings would be helpful.
Machine Learning: Knowledge of training models and validating text data features will be useful.
Libraries: Knowledge of Python libraries like NumPy, Pandas, NLTK, Gensim, and scikit-learn is necessary.
Basic Linear Algebra: Understanding vectors, matrices, and cosine similarity is fundamental to making sense of word embeddings.

Approach

It begins with the preprocessing of text data cleaning, tokenization, and stop word removal. Then three different word embedding models, CBOW, Skip-Gram, and FastText are trained using processed text to capture the relationship of words in a high-dimensional vector space. Comparing these models will be based on word similarity checking, a test of analogical reasoning—for example, 'doctor + medicine - hospital' as well as the detection of outliers within a group of words. Finally, dimensionality reduction techniques like PCA and t-SNE will be applied to make sense of these high-dimensional word vectors in a 2D space to visualize their relationships. Ultimately, all these will lead to comparing each model's performance in capturing semantic relationships between words, thereby giving insights into how different word embeddings represent language.

Workflow

Data Collection: Collect and load the dataset that contains text information (titles and abstracts').
Preprocessing: Clean, tokenize, and stop word removal.
Word Embedding Models: Build CBOW, Skip-Gram, and Fasttext-word embedding models over the processed text.
Evaluate Models: Evaluate word embedding through checking of similarity, analogy, and anomalies.
Dimensionality Reduction: Perform the PCA and t-SNE to visualize the word embeddings in 2D.
Visualization: Results will be visualized using Matplotlib and Plotly for better interpretation.
Analysis: Comparisons and analyses of how each model's performance would be based as a result of the embeddings it generated would be done.

Methodology

Text Preprocessing: Use NLTK to clean and preprocess the text data (tokenization, stopword removal).
Word2Vec and FastText: Train CBOW, Skip-Gram, and FastText models using Gensim to produce word embeddings.
Similarity Metrics: Compute similarity between words by Cosine Similarity of Word Vectors.
Analogical Reasoning: Perform tasks of analogical reasoning such as doing "Doctor + medicine - Hospital"; to assess the models
Anomaly Detection: Use the doesn't match method to identify the outlier word in any group.
Dimensionality Reduction: Reduction of dimensional high word embedding using PCA and t-SNE for its visualization.
Visualization: Plots the word embeddings in 2D through scatter plots for comparison purposes between models visually.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Loading Dataset: The dataset will be imported which contains text data in the form of titles and abstracts.
Missing Values Checking: Missing Values can then be detected and treated in the dataset.
Select Relevant Columns: The text columns "title" and "abstract" are extracted for analysis.
Drop Missing Data: Remove all the rows where no values are present in the selected text columns.
Merge the Text: Now merge these two columns "title" and "abstract" into a single column for analysis.
Text Preprocessing: Cleaning and tokenizing the joint text: Lowercase, remove punctuation, and filter out stopwords.
Tokenization: The disintegration of the text into separate words or tokens to carry out subsequent processing.
Lemmatization: Lemmatization is the process of reducing the word to its base form; e.g., "running" becomes "run.".
Preparation of Modeling: The preprocessed text is stored in the appropriate format to train word embedding models.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the data stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Library setup and installation

This installs the required libraries, textblob for text processing, streamlit for web app development, and wordcloud for making word clouds. It also downloads the necessary corpora for textblob to work properly.

!pip install textblob
!pip install streamlit
!pip install wordcloud
!python -m textblob.download_corpora

Importing Libraries for NLP and Visualization

This code imports libraries for text preprocessing, natural language processing (NLP), and machine learning. Some of the key tools being used are nltk to perform tokenization and lemmatization activities on the text corpus, gensim to obtain word embeddings, matplotlib and plotly for visualizing the data, and sklearn for machine learning activities like dimensionality reduction and feature extraction.

import re # used for preprocessing
import nltk
import gensim
import string # used for preprocessing
import numpy as np
import pandas as pd
nltk.download('punkt')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('punkt_tab')
from textblob import TextBlob
from collections import Counter
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from sklearn.manifold import TSNE
import plotly.graph_objects as go
from nltk.corpus import stopwords # used for preprocessing
from gensim.models import Word2Vec
from gensim.models import FastText
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer # used for preprocessing
from gensim.models import Word2Vec, FastText
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer

STEP 2:

Loading Data

This code loads the CSV file from the specified path

df=pd.read_csv('/content/drive/MyDrive/Aionlinecourse_badhon/Project/Word2Vec and FastText Word Embedding with Gensim in Python/Dimension-covid.csv')   #for preprocessing

Previewing Data

This block of code displays the first 10 rows of the dataset to give a quick overview of its structure.

df.head(10)

Checking Null Values

This code mainly checks whether the null data is present, which may be required to be treated before processing for further analysis.

df.isna().sum()

Dataset Overview

The code summaries along with the number of entries and column names, data types, and memory usage of the dataset df. This is useful for easily understanding the topology and composition of the dataset.

df.info()

Inspecting Text Columns and Missing Values

This piece of code selects Title and Abstract columns from the dataset, looking for possible missing values in these text-rich columns, and showing the initial few rows of them along with the missing counts to evaluate data quality.

# Inspect relevant text-rich columns: Title and Abstract
text_columns = df[['Title', 'Abstract']]
# Check for missing values in these columns
missing_values = text_columns.isnull().sum()
# Display the first few rows of the relevant columns and missing value counts
text_columns.head(), missing_values

Dropping Rows with Null Values

This code drops all rows from the text_columns Dataframe that miss data in either the Title or Abstract columns, producing a neat dataset to conduct further analyses.

# Drop rows with missing Abstract values
text_columns_cleaned = text_columns.dropna()

Checking Null Values

This code checks whether the null data is present after removing rows from the dataset.

text_columns_cleaned.isnull().sum()

Shape-checking of Data after Cleaning

The code returns the shape of the text_columns_cleaned DataFrame in terms of dimension (number of rows and columns). It provides an understanding of how many entries exist after dropping rows with missing data in it.

text_columns_cleaned.shape

STEP 3:

Merged Title and Abstract

This piece of code creates a new column, Combined_Text, by concatenation of columns Title and Abstract in text_columns_cleaned DataFrame.

text_columns_cleaned['Combined_Text'] = text_columns_cleaned['Title'] + " " + text_columns_cleaned['Abstract']

WordNetLemmatizer and Stopwords Initializer

The code initializes a WordNetLemmatizer for lemmatization and loads a set of English stopwords using nltk. These will be used for text preprocessing, such as reducing words to their root forms and eliminating common insignificant terms.

# Initialize lemmatizer and stopwords
lemmatizer = WordNetLemmatizer()
stop_words = set(stopwords.words('english'))

Text Preprocessing Function

The function processes input in lower case, without punctuation and special characters, tokenizes into words, cleanses stopwords, and completes the lemmatization of each token. Returns the cleaned, processed tokens for further analysis.

# Function to preprocess text
def preprocess_text(text):
# Lowercase the text
text = text.lower()
# Remove punctuation and special characters
text = re.sub(r'\[^a-zA-Z\\s\]', '', text)
# Tokenize the text
tokens = word\_tokenize(text)
# Remove stopwords and lemmatize
tokens = \[lemmatizer.lemmatize(word) for word in tokens if word not in stop\_words\]
return tokens