Project Overview
This study uses Natural Language Processing (NLP) methods for text-document analysis via word embedding models. First, the text data is pre-processed, cleansing, tokenizing, and removing stopwords. Thereafter, three popular models- CBOW, Skip-Gram, and FastText-are trained on the dataset, creating vector representations capable of defining the relationship of words with each other.
Afterward, it will explore the following: word similarity, analogical reasoning- for instance, 'doctor + medicine' - hospital and outlier detection of groups of words. To better understand the performance of these models, we implement diminution techniques PCA and t-SNE to illustrate how devices work on arranging words in a 2D space. The project is geared towards comparing how different embeddings capture word meanings and relations: CBOW, Skip-Gram, and FastText.
Prerequisites
- Python Programming: Basic knowledge of Python is relevant to this work.
- NLP Basics: Familiarity with typologies like tokenization and stopword removal along with word embeddings would be helpful.
- Machine Learning: Knowledge of training models and validating text data features will be useful.
- Libraries: Knowledge of Python libraries like NumPy, Pandas, NLTK, Gensim, and scikit-learn is necessary.
- Basic Linear Algebra: Understanding vectors, matrices, and cosine similarity is fundamental to making sense of word embeddings.
Approach
It begins with the preprocessing of text data cleaning, tokenization, and stop word removal. Then three different word embedding models, CBOW, Skip-Gram, and FastText are trained using processed text to capture the relationship of words in a high-dimensional vector space. Comparing these models will be based on word similarity checking, a test of analogical reasoning—for example, 'doctor + medicine - hospital' as well as the detection of outliers within a group of words. Finally, dimensionality reduction techniques like PCA and t-SNE will be applied to make sense of these high-dimensional word vectors in a 2D space to visualize their relationships. Ultimately, all these will lead to comparing each model's performance in capturing semantic relationships between words, thereby giving insights into how different word embeddings represent language.
Workflow
- Data Collection: Collect and load the dataset that contains text information (titles and abstracts').
- Preprocessing: Clean, tokenize, and stop word removal.
- Word Embedding Models: Build CBOW, Skip-Gram, and Fasttext-word embedding models over the processed text.
- Evaluate Models: Evaluate word embedding through checking of similarity, analogy, and anomalies.
- Dimensionality Reduction: Perform the PCA and t-SNE to visualize the word embeddings in 2D.
- Visualization: Results will be visualized using Matplotlib and Plotly for better interpretation.
- Analysis: Comparisons and analyses of how each model's performance would be based as a result of the embeddings it generated would be done.
Methodology
- Text Preprocessing: Use NLTK to clean and preprocess the text data (tokenization, stopword removal).
- Word2Vec and FastText: Train CBOW, Skip-Gram, and FastText models using Gensim to produce word embeddings.
- Similarity Metrics: Compute similarity between words by Cosine Similarity of Word Vectors.
- Analogical Reasoning: Perform tasks of analogical reasoning such as doing "Doctor + medicine - Hospital"; to assess the models
- Anomaly Detection: Use the doesn't match method to identify the outlier word in any group.
- Dimensionality Reduction: Reduction of dimensional high word embedding using PCA and t-SNE for its visualization.
- Visualization: Plots the word embeddings in 2D through scatter plots for comparison purposes between models visually.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Loading Dataset: The dataset will be imported which contains text data in the form of titles and abstracts.
- Missing Values Checking: Missing Values can then be detected and treated in the dataset.
- Select Relevant Columns: The text columns "title" and "abstract" are extracted for analysis.
- Drop Missing Data: Remove all the rows where no values are present in the selected text columns.
- Merge the Text: Now merge these two columns "title" and "abstract" into a single column for analysis.
- Text Preprocessing: Cleaning and tokenizing the joint text: Lowercase, remove punctuation, and filter out stopwords.
- Tokenization: The disintegration of the text into separate words or tokens to carry out subsequent processing.
- Lemmatization: Lemmatization is the process of reducing the word to its base form; e.g., "running" becomes "run.".
- Preparation of Modeling: The preprocessed text is stored in the appropriate format to train word embedding models.