Image

Topic modeling using K-means clustering to group customer reviews

Have you ever thought about the ways one can analyze a review to extract all the misleading or useful information? The present project is about analyzing customer reviews through sentiment analysis, topic modeling, or clustering.

Project Overview

The goal of this project is to study consumer reviews and use them creatively to derive useful insights. Reviews are first processed and cleaned using NLTK and Scikit-learn. Next, these reviews attribute sentiments such as positive, neutral, or negative depending on the rating given using models such as Random Forest and Naive Bayes to mention a few. But wait! Thanks to LDA, we can also do some topic modeling and learn what topics are present but not visible. K-Means is a clustering technique that allows us to analyze and interpret a set of clusters formed by several similar reviews. Last but not least, we make very creative visualizations such as word clouds and sentiment heat maps. What a wonderful way to demonstrate the potential of data!

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

  • Python version 3.7 or higher installed on your system.
  • Understanding of basic knowledge of Python for data analysis and manipulation
  • Knowledge of libraries such as NLTK, Gensim, Scikit-learn, Pandas, NumPy, Seaborn, Matplotlib, pyLDAvis, and WordCloud is necessary.
  • The dataset consists of customer review data with Rating and Review columns.
  • Jupyter Notebook, VScode, or a Python-compatible IDE.

Approach

The structure of the project begins with data preprocessing works, which include cleaning, tokenizing, and lemmatizing of the reviews. Tools such as NLTK are Used in conducting this activity to maintain consistency across the reviews. After that in the process of review analysis machine learning methods like Random Forest, Naive Bayes, and others are used to divide the reviews into positive, neutral, and negative. Then, LDA – an advanced Bayesian technique for topic modeling – is used to analyze customer reviews to identify more themes in the customer feedback. K-Means is also implemented to cluster the reviews to facilitate the identification of the trends and patterns. Adequate infographics such as word clouds, sentiment heat maps, and clustering plots are also provided for a better understanding of the analysis. This disciplined methodology guarantees thorough inquiry of customer reviews.

Workflow and Methodology

Workflow

  • Data Collection
    • Obtain consumer reviews with specific columns: Rating and Review
  • Data Preprocessing
    • Edit content materials by deleting, for instance, punctuation marks, numbers, and even stopwords.
    • Using NLTK perform text tokenization and lemmatization for text standardization.
  • Exploratory Data Analysis (EDA)
    • The distribution of ratings and the lengths of reviews will be examined.
    • The frequency of certain words and the most popular ones will be demonstrated in Barchart and word clouds.
  • Sentiment Analysis
    • Ratings are classified as follows: positive, neutral, or negative feelings.
    • Develop algorithms including Random Forest, Naive Bayes, and Logistic Regression, to assign sentiments to given reviews.
    • We will also analyze the results through accuracy, confusion matrix, classification report, etc.
  • Topic Modeling
    • Compile a dictionary and build a corpus from the cleaned-up reviews.
    • Pursue LDA in an attempt to unearth underlying topics and their corresponding verbiage.
    • Use the pyLDAvis library to surf the topics interestingly.
  • Clustering
    • In this regard, the text will be translated into its numerical representation using TF-IDF vectors.
    • Churn out K-Means clusters for the sake of analysis of the textual data present in the reviews.
    • Performed PCA to facilitate better visualization and interpretation of the data.
  • Visualization
    • We make use of word clouds for large clusters to bring out the most frequently mentioned terms.
    • Plot clusters and topics for easy understanding of patterns and trends.

Methodology

  • Collect the customer reviews and clean the data by removing any unwanted symbols, tokenizing the text, and lemmatizing the words.
  • Map ratings to sentiment labels: to be categorized as either Positive, Neutral, or negative.
  • Continue to train machine learning models associated with Random Forest and Naive Bayes to analyze sentiments.
  • Use LDA to extract latent topics and keywords existing in customers’ comments.
  • Depending on semantic patterns, K-Means clustering is to be used to group similar reviews.
  • Present the result in the form of a word cloud, heat map, and some clustering plot to get a better view.

Data Collection and Preparation

Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

  • Import the dataset with customer reviews along with the ratings provided.
  • Transform the text to lowercase and eliminate numerical information, special symbols, and punctuation marks.
  • Fragment the reviews into respective words with the help of NLTK libraries.
  • Omit stopwords such as ‘the’, ‘and’, and ‘is’ with the help of a built-in NLTK stopword list.
  • Reduce words to their base form using WordNetLemmatizer.
  • Eliminate any words that are less than three characters to reduce noise.
  • Employ methods to preserve the text for rural and urban areas for later evaluation.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Installing Necessary Libraries

This code installs libraries for data processing, visualization, topic modeling, and machine learning tasks.

!pip install nltk
!pip install numpy
!pip install pandas
!pip install gensim
!pip install seaborn
!pip install xgboost
!pip install pyLDAvis
!pip install wordcloud
!pip install matplotlib
!pip install scikit-learn

Suppressing Warnings

This code disables all types of warnings to keep the output clean and focused.

# Ignore all warnings
import warnings
warnings.filterwarnings('ignore')
warnings.filterwarnings(action='ignore')
warnings.filterwarnings("ignore ", category=FutureWarning)
warnings.filterwarnings("ignore", category=DeprecationWarning)
warnings.filterwarnings("default", category=DeprecationWarning)

NLTK Data Installation

The following code ensures the availability of basic NLTK data and tools used for text splitting, lemmatization, opinion mining, and the filtration of common words.

import nltk
nltk.download('punkt')
nltk.download('omw-1.4')
nltk.download('wordnet')
nltk.download('stopwords')
nltk.download('vader_lexicon')

Importing Libraries for Text Processing, Visualization, and Machine Learning

This code is importing tools for NLP and Clustering and Classification and Dimensionality Reduction and Sentiment Analysis and Evaluation of the Performance among others.

import re
import gensim
import string
import pyLDAvis
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import pyLDAvis.gensim_models as gensimvis
from PIL import Image
from gensim import corpora
from wordcloud import WordCloud
from collections import Counter
from wordcloud import WordCloud
from xgboost import XGBClassifier
from sklearn.svm import LinearSVC
from nltk.corpus import stopwords
from sklearn.cluster import KMeans
from sklearn.decomposition import PCA
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.naive_bayes import MultinomialNB
from IPython.core.display import display, HTML
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.sentiment.vader import SentimentIntensityAnalyzer
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, adjusted_rand_score

STEP 2:

Loading Data and Checking Dimensions:

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

# Load dataset
file_path = '/content/drive/MyDrive/New 90 Projects/Project_6/hotel_reviews_Datasets.csv'
df = pd.read_csv(file_path)
# Check dataset
df.head()

Chacking the Distribution of Ratings

This piece of code provides an estimation of the percentage of the total for each unique rating present in the Rating column.

df['Rating'].value_counts(normalize=True)

Visual representation of rating distribution

This piece of code will generate a bar chart with an overlay of different colors on the bars, showing the total number of reviews received in each rating category using Seaborn.

# Visualize rating distribution with custom colors
custom_colors = sns.color_palette("viridis", as_cmap=False, n_colors=df['Rating'].nunique())
sns.countplot(data=df, x='Rating', palette=custom_colors)
plt.title('Count of Reviews by Rating')
plt.xlabel('Rating')
plt.ylabel('Count')
plt.show()

Measuring Review Length

This program creates an additional column in the dataset entitled Length to keep track of the count of characters per review.

# Count length of the reviews
df['Length'] = df['Review'].apply(len)
df.head()

Visualization of Review Length for Rating

This code employs a Kernel Density Estimation (KDE) plot to visualize the review length distribution corresponding to each rating.

# Visualize length distribution based on the rating with KDE plot
sns.displot(data=df, x='Length', hue='Rating', kind='kde', fill=True, aspect=3)
plt.title('Review Length Distribution by Rating')
plt.xlabel('Length of Review')
plt.ylabel('Density')
plt.show()

Cleaning Text Reviews

This function will clean reviews of numbers, punctuation, stopwords, and repeated characters, and lemmatize text as well.

# Define a function to clean the text
def clean_text(text):
    text = re.sub(r'\d+', '', text)
    text = re.sub(r'[^\w\s]', '', text)
    text = re.sub(r'(.)\1\1+', r"\1\1", text)
    text = text.translate(str.maketrans('', '', string.punctuation))
    text = text.lower()
    tokens = word_tokenize(text)
    stop_words = set(stopwords.words('english'))
    tokens = [w for w in tokens if not w in stop_words]
    tokens = [w for w in tokens if len(w) > 2]
    lemmatizer = WordNetLemmatizer()
    tokens = [lemmatizer.lemmatize(w) for w in tokens]
    text = ' '.join(tokens)
    return text
# Apply new function
df['Review'] = df['Review'].apply(clean_text)

Most Frequent Words Extraction

The implemented code in this section retrieves and shows the count of the top 20 words which appeared the most in the cleaned reviews.

# Find the most common words
all_words = ' '.join(df['Review']).lower().split()
word_counts = Counter(all_words)
top_words = word_counts.most_common(20)
print(top_words)

Visualizing Most Common Words

This code creates a horizontal bar chart to show the top 20 most frequent words with specific colors.

# Visualize most common words with different colors
fig, ax = plt.subplots()
colors = sns.color_palette("viridis", n_colors=len(top_words))  # Use a color palette for distinct colors
ax.barh([word for (word, count) in top_words], [count for (word, count) in top_words], color=colors)
ax.set_xlabel('Words')
ax.set_ylabel('Frequency')
ax.set_title('Top 20 Most Common Words')
plt.show()

Making a Word Cloud with Mask

This piece of code is to illustrate the most commonly used words creatively with a word cloud in the shape of a user-defined mask image.

# Load mask image (replace 'mask_image.png' with your actual image path)
mask_image = np.array(Image.open('/content/drive/MyDrive/New 90 Projects/Project_6/Hotel_icon.png'))
# Create a word cloud with a background mask
wordcloud = WordCloud(width=800,
                      height=400,
                      background_color='lightyellow',
                      mask=mask_image,  # Use the mask image
                      contour_width=1,  # Optional: outline around words
                      contour_color='Red'  # Optional: color of the outline
                     ).generate_from_frequencies(dict(top_words))
# Plot the word cloud
plt.figure(figsize=(12, 6))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()

STEP 3:

Classifying Review Sentiment

This code uses a sentiment analyzer to categorize reviews as positive/negative or neutral and then counts the total for each category.

# Define a function to classify the sentiment of a review
sia = SentimentIntensityAnalyzer()
def get_sentiment(review):
    scores = sia.polarity_scores(review)
    sentiment_score = scores['compound']
    if sentiment_score > 0.1:
        return 'positive'
    elif sentiment_score < -0.1:
        return 'negative'
    else:
        return 'neutral'
# Apply function on dataset copy
df2 = df.copy()
df2['Predicted_Sentiment'] = df2['Review'].apply(get_sentiment)
# Print the number of positive, negative, and neutral reviews
print("Number of positive reviews:", len(df2[df2['Predicted_Sentiment'] == 'positive']))
print("Number of negative reviews:", len(df2[df2['Predicted_Sentiment'] == 'negative']))
print("Number of neutral reviews:", len(df2[df2['Predicted_Sentiment'] == 'neutral']))

Mapping Ratings to True Sentiment.

This code assigns the numerical values of these ratings into negative, neutral, and positive and adds them as a new column.

# Map the rating column to create new column true sentiment
df2['True_Sentiment'] = df2['Rating'].map({1: 'negative',
                                           2: 'negative',
                                           3: 'neutral',
                                           4: 'positive',
                                           5: 'positive'})
df2.head()

Visualizing Sentiment Confusion Matrix

This piece of code creates and displays a heatmap to analyze actual and predicted sentiments with performance measures for sentiment analysis included.

# Calculate confusion matrix
cm = confusion_matrix(df2['True_Sentiment'], df2['Predicted_Sentiment'])
# Create heatmap with a different color map
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted Sentiment')
plt.ylabel('True Sentiment')
plt.title('Confusion Matrix for Sentiment Analysis')
plt.show()

Printing out Classification Report

This code has been implemented for generating a detailed classification report illustrating the precision, recall, and F1-score for each of the sentiment categories.

print("\nClassification report:\n", classification_report(df2['True_Sentiment'],
                                                          df2['Predicted_Sentiment']))

Ratings to Numeric Sentiment mapping

This code creates a new column, Sentiment, mapping ratings to numeric values: For positive, 2; neutral, 1; negative, 0.

# Define function for new column sentiment
positive = [4, 5]
neutral = [3]
negative = [1, 2]
def map_sentiment(rating):
    if rating in positive:
        return 2
    elif rating in neutral:
        return 1
    else:
        return 0
df['Sentiment']= df['Rating'].apply(map_sentiment)

STEP 4:

Getting Data Ready for Modeling

The following code starts by converting the reviews to TF-IDF features and later performs training and testing data split for sentiment analysis.

# Prepare data for modeling
tfidf = TfidfVectorizer(ngram_range=(1, 3), max_features=10000, tokenizer = word_tokenize)
X = tfidf.fit_transform(df['Review'])
y = df['Sentiment']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.25, random_state=24)

Training a Random Forest Model

In this code, a Random Forest Classifier is trained on the sentiment dataset, used to predict the sentiments of the test data and assess accuracy and classification metrics.

# Build the model
rf = RandomForestClassifier()
rf.fit(X_train, y_train)
predicted_rf = rf.predict(X_test)
# Calculate accuracy and print classification report
accuracy_rf = accuracy_score(y_test, predicted_rf)
print('Accuracy:', accuracy_rf)
print('Classification Report:')
print(classification_report(y_test, predicted_rf))

Visualizing Confusion Matrix for Random Forest

In this code, a heatmap is constructed to analyze the actual versus predicted sentiments for the Random Forest model.

# Build confusion matrix
cm_rf = confusion_matrix(y_test, predicted_rf)
# Create heatmap
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm_rf, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted sentiment')
plt.ylabel('True sentiment')
plt.show()

Visualization of Classification Metrics

The following code generates a bar graph indicating the precision, recall and F1-score classification metrics for each class as predicted by a Random Forest.

report = classification_report(y_test, predicted_rf, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Classification Report - Random Forest')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.show()

Training a Multinomial Naive Bayes Model

In this code, a Multinomial Naive Bayes is trained on the sentiment dataset, used to predict the sentiments of the test data and assess accuracy and classification metrics.

# Build the model
nb = MultinomialNB()
nb.fit(X_train, y_train)
predicted_nb = nb.predict(X_test)
# Calculate accuracy and print classification report
accuracy_nb = accuracy_score(y_test, predicted_nb)
print('Accuracy:', accuracy_nb)
print('Classification Report:')
print(classification_report(y_test, predicted_nb))

Visualizing Confusion Matrix for MultiNomial Naive Bayes

In this code, a heatmap is constructed to analyze the actual versus predicted sentiments for the Multinomial Naive Bayes model.

# Build confusion matrix
cm_nb = confusion_matrix(y_test, predicted_nb)
# Create heatmap
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm_nb, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted sentiment')
plt.ylabel('True sentiment')
plt.show()

Visualization of Classification Metrics

The following code generates a bar graph indicating the precision, recall, and F1-score classification metrics for each class as predicted by a Multinomial Naive Bayes

report = classification_report(y_test, predicted_nb, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Classification Report - Naive Bayes')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.show()

Training an XGBoost Model

In this code, an XGboost Classifier is trained on the sentiment dataset, which is used to predict the sentiments of the test data and assess accuracy and classification metrics.

# Build the model
xgb = XGBClassifier()
xgb.fit(X_train,y_train)
predicted_xgb = xgb.predict(X_test)
# Calculate accuracy and print classification report
accuracy_xgb = accuracy_score(y_test, predicted_xgb)
print('Accuracy:', accuracy_xgb)
print('Classification Report:')
print(classification_report(y_test, predicted_xgb))

Visualizing Confusion Matrix for XGBoost

This code constructs a heatmap to analyze the actual versus predicted sentiments for the XGBoost model.

# Build confusion matrix
cm_xgb = confusion_matrix(y_test, predicted_xgb)
# Create heatmap
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm_xgb, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted sentiment')
plt.ylabel('True sentiment')
plt.show()

Visualization of Classification Metrics

The following code generates a bar graph indicating the precision, recall and F1-score classification metrics for each class as predicted by an XGBoost model.

report = classification_report(y_test, predicted_xgb, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Classification Report - XGBoost')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.show()

Training a Logistic Regression Model

In this code, a Logistic Regression Model is trained on the sentiment dataset, which is used to predict the sentiments of the test data and assess accuracy and classification metrics.

# Build model
lr = LogisticRegression(max_iter=1000)
lr.fit(X_train, y_train)
predicted_lr = lr.predict(X_test)
# Calculate accuracy and print classification report
accuracy_lr = accuracy_score(y_test, predicted_lr)
print('Accuracy:', accuracy_lr)
print('Classification Report:')
print(classification_report(y_test, predicted_lr))

Visualizing Confusion Matrix for Logistic Regression

This code constructs a heatmap to analyze the actual versus predicted sentiments for the Logistic Regression model.

# Build confusion matrix
cm_lr = confusion_matrix(y_test, predicted_lr)
# Create heatmap
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm_lr, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted sentiment')
plt.ylabel('True sentiment')
plt.show()

Visualization of Classification Metrics

The following code generates a bar graph indicating the precision, recall and F1-score classification metrics for each class as predicted by a Logistic Regression.

report = classification_report(y_test, predicted_lr, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Classification Report - Logistic Regression')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.show()

Training a Linear Support Vector Model

In this code, a Linear Support Vector Classifier is trained on the sentiment dataset, which is used to predict the sentiments of the test data and assess accuracy and classification metrics.

# Build model
svc = LinearSVC(random_state=42)
svc.fit(X_train, y_train)
predicted_svc = svc.predict(X_test)
# Calculate accuracy and print classification report
accuracy_svc = accuracy_score(y_test, predicted_svc)
print('Accuracy:', accuracy_svc)
print('Classification Report:')
print(classification_report(y_test, predicted_svc))

Visualizing Confusion Matrix for Linear Support Vector

This code constructs a heatmap to analyze the actual versus predicted sentiments for the Linear Support Vector model.

# Build confusion matrix
cm_svc = confusion_matrix(y_test, predicted_svc)
# Create heatmap
labels = ['Negative', 'Neutral', 'Positive']
sns.heatmap(cm_svc, annot=True, cmap='coolwarm', fmt='g', xticklabels=labels, yticklabels=labels)
plt.xlabel('Predicted sentiment')
plt.ylabel('True sentiment')
plt.show()

Visualization of Classification Metrics

The following code generates a bar graph indicating the precision, recall and F1-score classification metrics for each class as predicted by a Linear Support Vector.

report = classification_report(y_test, predicted_svc, output_dict=True)
report_df = pd.DataFrame(report).transpose()
report_df[['precision', 'recall', 'f1-score']].plot(kind='bar', figsize=(10, 6), color=['skyblue', 'salmon', 'lightgreen'])
plt.title('Classification Report - Linear Support Vector Classification')
plt.xlabel('Classes')
plt.ylabel('Scores')
plt.xticks(rotation=45)
plt.legend(loc='lower right')
plt.show()

Analysis of the Performance of Various Models

This code generates a DataFrame that facilitates the comparison of accuracy scores for various models and arranges them in a descending order.

# Compare models performance
Models = ['Random Forest', 'Naive Bayes Multinominal', 'XGBoost', 'Logistic Regression', 'SVC']
Scores = [accuracy_rf, accuracy_nb, accuracy_xgb, accuracy_lr, accuracy_svc]
performance = pd.DataFrame(list(zip(Models, Scores)),
                          columns = ['Models', 'Accuracy_score'])\
                            .sort_values('Accuracy_score', ascending=False)
performance

The comparison of accuracies among various models is represented using a bar graph and with percentages inscribed at the top of every bar drawn.

# Model names and accuracy scores
Models = ['Random Forest', 'Naive Bayes', 'XGBoost', 'Logistic Regression', 'SVC']
Scores = [accuracy_rf, accuracy_nb, accuracy_xgb, accuracy_lr, accuracy_svc]
# Plotting the model accuracy comparison
plt.figure(figsize=(10, 6))
bars = plt.bar(Models, Scores, color=['skyblue', 'salmon', 'lightgreen', 'orange', 'purple'])
# Adding accuracy percentage text above each bar
for bar, score in zip(bars, Scores):
    plt.text(bar.get_x() + bar.get_width() / 2, bar.get_height() + 0.02, f'{score * 100:.0f}%',
             ha='center', color='red', fontsize=12, fontweight='bold')
# Setting labels and title
plt.xlabel("Models")
plt.ylabel("Accuracy Score")
plt.title("Model Accuracy Comparison")
plt.ylim(0, 1.1)  # Extend y-axis limit slightly above 1 to fit the percentage text
plt.show()

STEP 5:

Topic Modeling Data Preparation

In this phase, the reviews are processed, which means that the reviews are tokenized, stopwords are thrown away, short words filtered out, lemmatization is applied, and then the tokens are prepared for topic modeling.

#Prepare data for topic modeling
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()
# Preprocess the reviews
def preprocess(review):
    review = review.lower()
    tokens = nltk.word_tokenize(review)
    tokens = [token for token in tokens if token not in stop_words]
    tokens = [token for token in tokens if len(token) > 2]
    tokens = [lemmatizer.lemmatize(token) for token in tokens]
    return tokens
# Apply preprocessing to the reviews
reviews = [preprocess(review) for review in df['Review']]

Performing Topic Modeling with LDA

This Code format takes preprocessed reviews and builds a dictionary and a corpus out of those reviews and then uses LDA to train a model with 5 topics and shows the most relevant words for each of the topics.

# Create a dictionary and corpus for the reviews
dictionary = corpora.Dictionary(reviews)
corpus = [dictionary.doc2bow(review) for review in reviews]
# Train an LDA model on the corpus
lda_model = gensim.models.LdaModel(corpus=corpus, id2word=dictionary, num_topics=5)
# Print the topics and the top words for each topic
for topic in lda_model.show_topics(num_topics=5):
    print('Topic', topic[0])
    print('Top words:', topic[1], '\n')

Visualizing Topics with pyLDAvis

In this code, the Colab Notebook interface is modified slightly where the notebook is set to full width and pyLDAvis is used to draw an interactive visualization of the LDA topics.

display(HTML("<style>.container { max-width:100% !important; }</style>"))
display(HTML("<style>.output_result { max-width:100% !important; }</style>"))
display(HTML("<style>.output_area { max-width:100% !important; }</style>"))
display(HTML("<style>.input_area { max-width:100% !important; }</style>"))
# Visualize the topics using pyLDAvis
warnings.filterwarnings("ignore", category=FutureWarning) #supressing unnecessary warning
vis = gensimvis.prepare(lda_model, corpus, dictionary)
pyLDAvis.display(vis)

K-Means Clustering of Negative Reviews

The code segregates negative responses in 3 clusters using TF-IDF vectorization and performs clustering evaluation via the Adjusted Rand Index.

# Choose only negative reviews
df_neg = df[df['Rating'] <= 2]
# Convert text to numerical vectors using TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df_neg['Review'])
# Cluster the documents using K-Means algorithm
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=100, n_init=1, random_state=0)
kmeans.fit(X)
# Evaluate the performance of the clustering using adjusted Rand index
y_true = df_neg['Rating'].values
y_pred = kmeans.labels_
print('Adjusted Rand index:', adjusted_rand_score(y_true, y_pred))

PCA and Top Terms Cluster Analysis

The following piece of code utilizes PCA to lower the dimensions of the TF-IDF vectors to 2 and displays the 10 most relevant terms in each of the clusters to get an idea about their themes.

# Reduce the dimensionality of the vectors to 2 using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X.toarray())
# Print the top terms per cluster
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(num_clusters):
    print(f"Cluster {i+1} top terms:", [terms[ind] for ind in order_centroids[i, :10]])
    print('-------')

The Generated Text Clusters

The following code visualizes the 3 clusters of negative reviews in a two-dimensional space by clearly showing delineating boundaries with different colors for the clusters.

# Plot the clusters
colors = ['red', 'green', 'blue']
for i in range(num_clusters):
    plt.scatter(X_pca[kmeans.labels_ == i, 0], X_pca[kmeans.labels_ == i, 1], s=50, c=colors[i], label='Cluster {}'.format(i))
plt.legend()
plt.title('Text Clustering using K-Means')
plt.show()

Visualization of Word Clouds for Clusters

The following code helps to create a word cloud for each negative review cluster indicating the commonest words in various backgrounds and a mask shape.

# Set up the plot with 1 row and 3 columns
fig, axes = plt.subplots(1, num_clusters, figsize=(20, 8))  # Adjust figsize as needed
fig.suptitle("Most Frequent Words in Each Cluster", fontsize=16)
# Load the mask image (replace with your actual path)
mask_image = np.array(Image.open('/content/drive/MyDrive/New 90 Projects/Project_6/Hotel_icon.png'))
# Define different background colors for each cluster
background_colors = ['white', 'lightblue', 'lightyellow']
# Generate and plot word clouds for each cluster
for i in range(num_clusters):
    cluster_reviews = df_neg['Review'][kmeans.labels_ == i]
    cluster_text = ' '.join(cluster_reviews)
    # Create the word cloud with different background colors
    wordcloud = WordCloud(width=800,
                          height=400,
                          background_color=background_colors[i],
                          mask=mask_image,
                          contour_width=1,
                          contour_color='red'
                         ).generate(cluster_text)
    # Plot the word cloud on the corresponding subplot
    axes[i].imshow(wordcloud, interpolation='bilinear')
    axes[i].axis('off')
    axes[i].set_title(f'Cluster {i+1}', fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.95])  # Adjust layout to fit the title
plt.show()

K-Means Clustering of Positive Reviews

The code segregates positive reviews in 3 clusters using TF-IDF vectorization and performs clustering evaluation via the Adjusted Rand Index.

# Choose only positive reviews
df_pos = df[df['Rating'] >= 4]
# Convert text to numerical vectors using TF-IDF vectorizer
vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(df_pos['Review'])
# Cluster the documents using K-Means algorithm
num_clusters = 3
kmeans = KMeans(n_clusters=num_clusters, init='k-means++', max_iter=100, n_init=1, random_state=0)
kmeans.fit(X)
# Evaluate the performance of the clustering using adjusted Rand index
y_true = df_pos['Rating'].values
y_pred = kmeans.labels_
print('Adjusted Rand index:', adjusted_rand_score(y_true, y_pred))

PCA and Top Terms Cluster Analysis for Positive Reviews

The following piece of code utilizes PCA to lower the dimensions of the TF-IDF vectors to 2 and displays the 10 most relevant terms in each of the clusters to get an idea about their themes.

# Reduce the dimensionality of the vectors to 2 using PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X.toarray())
# Print the top terms per cluster
order_centroids = kmeans.cluster_centers_.argsort()[:, ::-1]
terms = vectorizer.get_feature_names_out()
for i in range(num_clusters):
    print(f"Cluster {i+1} top terms:", [terms[ind] for ind in order_centroids[i, :10]])
    print('-------')

The Generated Text Clusters

The following code visualizes the 3 clusters of positive reviews in a two-dimensional space by clearly showing delineating boundaries with different colors for the clusters.

# Plot the clusters
colors = ['red', 'green', 'blue']
for i in range(num_clusters):
    plt.scatter(X_pca[kmeans.labels_ == i, 0], X_pca[kmeans.labels_ == i, 1], s=50, c=colors[i], label='Cluster {}'.format(i))
plt.legend()
plt.title('Text Clustering using K-Means for Positive Reviews')
plt.show()

Visualization of Word Clouds for Clusters

The following code helps to create a word cloud for each negative review cluster indicating the commonest words in various backgrounds and a mask shape.

# Set up the plot with 1 row and 3 columns
fig, axes = plt.subplots(1, num_clusters, figsize=(20, 8))  # Adjust figsize as needed
fig.suptitle("Most Frequent Words in Each Cluster", fontsize=16)
# Load the mask image (replace with your actual path)
mask_image = np.array(Image.open('/content/drive/MyDrive/New 90 Projects/Project_6/Hotel_icon.png'))
# Define different background colors for each cluster
background_colors = ['white', 'lightblue', 'lightyellow']
# Generate and plot word clouds for each cluster
for i in range(num_clusters):
    cluster_reviews = df_pos['Review'][kmeans.labels_ == i]
    cluster_text = ' '.join(cluster_reviews)
    # Create the word cloud with different background colors
    wordcloud = WordCloud(width=800,
                          height=400,
                          background_color=background_colors[i],
                          mask=mask_image,
                          contour_width=1,
                          contour_color='red'
                         ).generate(cluster_text)
    # Plot the word cloud on the corresponding subplot
    axes[i].imshow(wordcloud, interpolation='bilinear')
    axes[i].axis('off')
    axes[i].set_title(f'Cluster {i+1}', fontsize=14)
plt.tight_layout(rect=[0, 0, 1, 0.95])  # Adjust layout to fit the title
plt.show()

Conclusion

In this work, we showed the potential of using Natural Language Processing (NLP) and machine learning in analyzing customer reviews. For analysis purposes, text preprocessing techniques like tokenization, lemmatization, and stopword removal were employed to perform sentiment analysis, topic detection and K-Means clustering of the reviews. Affective model predictions illustrated how precise sentiment analysis training can be enabled using, for instance, Random Forest or Naive Bayes models, while LDA was effective for topic modeling. Further muting of results was achieved by the use of visuals such as word clouds and cluster diagrams. It has been proven that analytics on text is applicable in making better decisions, and this project also concerns how such text analytics can be useful in analyzing customers’ feedback for companies.

Challenges New Coders Might Face

  • Challenge: Handling noisy or unstructured text data.
    Solution: Utilize text cleaning methods, which may include the exclusion of special symbols, figures, and extra spaces.

  • Challenge: Lack of labeled datasets for sentiment analysis.
    Solution: Assign ratings on three standard sentiment classes (positive, neutral, and negative) to generate labels.

  • Challenge: Curse of Dimensionality in the high dimensional text datasets affecting clustering and classification results.
    Solution: Use TF-IDF vectorization and reduction techniques like (PCA) to control dimensionality.

  • Challenge: Difficulty in understanding hidden topics from LDA results.
    Solution: Enhance topic observable results through proper use of visualization approaches such as pyLDAvis.

  • Challenge: Variation in model performance across different datasets.
    Solution: Use metrics to compare model performance using various models (e.g. Random Forest, Naive Bayes) and tune hyperparameters.

Frequently Asked Questions (FAQs)

Question 1: Define what "sentiment analysis" means in customers' reviews.
Answer: Sentiment analysis aims at classifying a review as positive, negative or neutral directed using Natural Language Processing and machine learning.

Question 2: What is LDA topic modeling and its relevance to customer feedback analysis?
Answer: The LDA topic modeling technique reveals the hidden patterns in the textual reviews in which businesses do find certain themes or topics that recur in the customers’ feedback.

Question 3: Why is K-Means clustering used for in-text analysis?
Answer: K-Means clustering is also used to classify customer sentiments by enabling the grouping of like reviews and helping in the detection and segmentation of the patterns emerging from the reviews.

Question 4: Which machine learning models are suitable for performing sentiment analysis?
Answer: Narrowing confinement in prediction is captured in the accuracy of the models commonly used such as Random Forest, Naive Bayesian, and Logistic Regression

Question 5: How do I carry out text preprocessing for NLP text processing systems?
Answer: Text preprocessing is the process that comes before the analysis of any large corpus of text and is done by cleaning, tokenizing, normalizing – lemmatizing, and filtering out stopwords from the data to bring the data in a consistent format.


Code Editor