Topic modeling using K-means clustering to group customer reviews

Have you ever thought about the ways one can analyze a review to extract all the misleading or useful information? The present project is about analyzing customer reviews through sentiment analysis, topic modeling, or clustering.

Project Overview

The goal of this project is to study consumer reviews and use them creatively to derive useful insights. Reviews are first processed and cleaned using NLTK and Scikit-learn. Next, these reviews attribute sentiments such as positive, neutral, or negative depending on the rating given using models such as Random Forest and Naive Bayes to mention a few. But wait! Thanks to LDA, we can also do some topic modeling and learn what topics are present but not visible. K-Means is a clustering technique that allows us to analyze and interpret a set of clusters formed by several similar reviews. Last but not least, we make very creative visualizations such as word clouds and sentiment heat maps. What a wonderful way to demonstrate the potential of data!

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

Python version 3.7 or higher installed on your system.
Understanding of basic knowledge of Python for data analysis and manipulation
Knowledge of libraries such as NLTK, Gensim, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib, pyLDAvis, and WordCloud is necessary.
The dataset consists of customer review data with Rating and Review columns.
Jupyter Notebook, VScode, or a Python-compatible IDE.

Approach

The structure of the project begins with data preprocessing works, which include cleaning, tokenizing, and lemmatizing of the reviews. Tools such as NLTK are Used in conducting this activity to maintain consistency across the reviews. After that in the process of review analysis machine learning methods like Random Forest, Naive Bayes, and others are used to divide the reviews into positive, neutral, and negative. Then, LDA – an advanced Bayesian technique for topic modeling – is used to analyze customer reviews to identify more themes in the customer feedback. K-Means is also implemented to cluster the reviews to facilitate the identification of the trends and patterns. Adequate infographics such as word clouds, sentiment heat maps, and clustering plots are also provided for a better understanding of the analysis. This disciplined methodology guarantees thorough inquiry of customer reviews.

Workflow and Methodology

Workflow

Data Collection
- Obtain consumer reviews with specific columns: Rating and Review
Data Preprocessing
- Edit content materials by deleting, for instance, punctuation marks, numbers, and even stopwords.
- Using NLTK perform text tokenization and lemmatization for text standardization.
Exploratory Data Analysis (EDA)
- The distribution of ratings and the lengths of reviews will be examined.
- The frequency of certain words and the most popular ones will be demonstrated in Barchart and word clouds.
Sentiment Analysis
- Ratings are classified as follows: positive, neutral, or negative feelings.
- Develop algorithms including Random Forest, Naive Bayes, and Logistic Regression, to assign sentiments to given reviews.
- We will also analyze the results through accuracy, confusion matrix, classification report, etc.
Topic Modeling
- Compile a dictionary and build a corpus from the cleaned-up reviews.
- Pursue LDA in an attempt to unearth underlying topics and their corresponding verbiage.
- Use the pyLDAvis library to surf the topics interestingly.
Clustering
- In this regard, the text will be translated into its numerical representation using TF-IDF vectors.
- Churn out K-Means clusters for the sake of analysis of the textual data present in the reviews.
- Performed PCA to facilitate better visualization and interpretation of the data.
Visualization
- We make use of word clouds for large clusters to bring out the most frequently mentioned terms.
- Plot clusters and topics for easy understanding of patterns and trends.

Methodology

Collect the customer reviews and clean the data by removing any unwanted symbols, tokenizing the text, and lemmatizing the words.
Map ratings to sentiment labels: to be categorized as either Positive, Neutral, or negative.
Continue to train machine learning models associated with Random Forest and Naive Bayes to analyze sentiments.
Use LDA to extract latent topics and keywords existing in customers’ comments.
Depending on semantic patterns, K-Means clustering is to be used to group similar reviews.
Present the result in the form of a word cloud, heat map, and some clustering plot to get a better view.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Import the dataset with customer reviews along with the ratings provided.
Transform the text to lowercase and eliminate numerical information, special symbols, and punctuation marks.
Fragment the reviews into respective words with the help of NLTK libraries.
Omit stopwords such as ‘the’, ‘and’, and ‘is’ with the help of a built-in NLTK stopword list.
Reduce words to their base form using WordNetLemmatizer.
Eliminate any words that are less than three characters to reduce noise.
Employ methods to preserve the text for rural and urban areas for later evaluation.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Installing Necessary Libraries

This code installs libraries for data processing, visualization, topic modeling, and machine learning tasks.

!pip install nltk
!pip install numpy
!pip install pandas
!pip install gensim
!pip install seaborn
!pip install xgboost
!pip install pyLDAvis
!pip install wordcloud
!pip install matplotlib
!pip install scikit-learn

Suppressing Warnings

This code disables all types of warnings to keep the output clean and focused.

# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore')
warnings.filterwarnings("ignore ", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("default", category=DeprecationWarning)

NLTK Data Installation

The following code ensures the availability of basic NLTK data and tools used for text splitting, lemmatization, opinion mining, and the filtration of common words.

import nltk
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('vader_lexicon')

Importing Libraries for Text Processing, Visualization, and Machine Learning

This code is importing tools for NLP and Clustering and Classification and Dimensionality Reduction and Sentiment Analysis and Evaluation of the Performance among others.

import re
import gensim
import string
import pyLDAvis
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pyLDAvis.gensim_models as gensimvis
from PIL import Image
from gensim import corpora
from wordcloud import WordCloud
from collections import Counter
from wordcloud import WordCloud
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import MultinomialNB
from IPython.core.display import display, HTML
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, adjusted_rand_score

STEP 2:

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

# Load dataset
file_path = '/content/drive/MyDrive/New 90 Projects/Project_6/hotel_reviews_Datasets.csv'
df = pd.read_csv(file_path)
# Check dataset
df.head()

Chacking the Distribution of Ratings

This piece of code provides an estimation of the percentage of the total for each unique rating present in the Rating column.

df['Rating'].value_counts(normalize=True)

Visual representation of rating distribution

This piece of code will generate a bar chart with an overlay of different colors on the bars, showing the total number of reviews received in each rating category using Seaborn.

# Visualize rating distribution with custom colors
custom_colors = sns.color_palette("viridis", as_cmap=False, n_colors=df['Rating'].nunique())
sns.countplot(data=df, x='Rating', palette=custom_colors)
plt.title('Count of Reviews by Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

Measuring Review Length

This program creates an additional column in the dataset entitled Length to keep track of the count of characters per review.

# Count length of the reviews
df['Length'] = df['Review'].apply(len)
df.head()

Visualization of Review Length for Rating

This code employs a Kernel Density Estimation (KDE) plot to visualize the review length distribution corresponding to each rating.

# Visualize length distribution based on the rating with KDE plot
sns.displot(data=df, x='Length', hue='Rating', kind='kde', fill=True, aspect=3)
plt.title('Review Length Distribution by Rating')
plt.xlabel('Length of Review')
plt.ylabel('Density')
plt.show()

Cleaning Text Reviews

This function will clean reviews of numbers, punctuation, stopwords, and repeated characters, and lemmatize text as well.

# Define a function to clean the text
def clean_text(text):
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'(.)\1\1+', r"\1\1", text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [w for w in tokens if len(w) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    text = ' '.join(tokens)
    return text
# Apply new function
df['Review'] = df['Review'].apply(clean_text)

Most Frequent Words Extraction

The implemented code in this section retrieves and shows the count of the top 20 words which appeared the most in the cleaned reviews.

# Find the most common words
all_words = ' '.join(df['Review']).lower().split()
word_counts = Counter(all_words)
top_words = word_counts.most_common(20)
print(top_words)

Visualizing Most Common Words

This code creates a horizontal bar chart to show the top 20 most frequent words with specific colors.

# Visualize most common words with different colors
fig, ax = plt.subplots()
colors = sns.color_palette("viridis", n_colors=len(top_words))  # Use a color palette for distinct colors
ax.barh([word for (word, count) in top_words], [count for (word, count) in top_words], color=colors)
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.set_title('Top 20 Most Common Words')
plt.show()

Making a Word Cloud with Mask

This piece of code is to illustrate the most commonly used words creatively with a word cloud in the shape of a user-defined mask image.

# Load mask image (replace 'mask_image.png' with your actual image path)
mask_image = np.array(Image.open('/content/drive/MyDrive/New 90 Projects/Project_6/Hotel_icon.png'))
# Create a word cloud with a background mask
wordcloud = WordCloud(width=800,
                      height=400,
                      background_color='lightyellow',
                      mask=mask_image,  # Use the mask image
                      contour_width=1,  # Optional: outline around words
                      contour_color='Red'  # Optional: color of the outline
                     ).generate_from_frequencies(dict(top_words))
# Plot the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

STEP 3:

Classifying Review Sentiment

This code uses a sentiment analyzer to categorize reviews as positive/negative or neutral and then counts the total for each category.

# Define a function to classify the sentiment of a review
sia = SentimentIntensityAnalyzer()
def get_sentiment(review):
    scores = sia.polarity_scores(review)
    sentiment_score = scores['compound']
    if sentiment_score > 0.1:
        return 'positive'
    elif sentiment_score < -0.1:
        return 'negative'
    else:
        return 'neutral'
# Apply function on dataset copy
df2 = df.copy()
df2['Predicted_Sentiment'] = df2['Review'].apply(get_sentiment)
# Print the number of positive, negative, and neutral reviews
print("Number of positive reviews:", len(df2[df2['Predicted_Sentiment'] == 'positive']))
print("Number of negative reviews:", len(df2[df2['Predicted_Sentiment'] == 'negative']))
print("Number of neutral reviews:", len(df2[df2['Predicted_Sentiment'] == 'neutral']))

Mapping Ratings to True Sentiment.

This code assigns the numerical values of these ratings into negative, neutral, and positive and adds them as a new column.

# Map the rating column to create new column true sentiment
df2['True_Sentiment'] = df2['Rating'].map({1: 'negative',
                                           2: 'negative',
                                           3: 'neutral',
                                           4: 'positive',
                                           5: 'positive'})
df2.head()