Sentiment Analysis for Mental Health Using NLP & ML

Get ready for an adventurous ride as we delve into machine learning and text analysis! This project focused on working with textual data, including cleaning, processing, and analyzing it to derive useful information. Employing state-of-the-art tools and methodologies, models were developed and assessed to classify mental health statements accurately.

Project Overview

We commenced the project by collecting and normalizing a set of text statements containing descriptions of mental health status. Our primary objective was to categorize the statements into known classes. We implemented TF-IDF feature extraction to collect necessary features, trained various classification models using machine learning algorithms, and finally employed visual techniques for better comprehension. In this process, we faced issues such as class imbalance, which we handled through resampling, and maximized the performance of our models by optimizing hyperparameters. Lastly, we persistently stored the trained model in anticipation of predicting the future and put it through a real-world scenario test!

Prerequisites

Learners must develop some skills before undertaking this project. Here’s what you should ideally know:

A fundamental understanding of Python programming language, machine learning concepts and ideas, and text preprocessing methods and techniques.
Knowledge and understanding of libraries including Pandas, NumPy, Scikit-learn, Matplotlib, and Seaborn for data analysis and data visualization.
Knowledge of feature extraction methods, such as TF-IDF backed by implementations, and the handling of class imbalance through resampling.
Familiarity with the end-to-end cycle of designing a classifier including training, evaluation, persistence, and prediction on machine learning models.
Jupyter Notebook, VScode, or a Python-compatible IDE.

Approach

Initially, the dataset was pre-processed by eliminating irrelevant elements such as URLs, and special symbols, and tokenizing the content. Stemming was included to sanitize the words used in their base forms, and feature extraction was done with TF-IDF and other numeric measures like character and sentence counts. To achieve balance in the dataset, Random Over-Sampling was performed while making sure all categories were fairly represented. Models of machine learning composed of Logistic Regression, Decision Tree, Naive Bayes, and XGBoost were developed, improved, and performed in training and testing for the use of various metrics including accuracy and confusion matrices. Lastly, the model with the best performance was optimized and applied to new data for effective prediction to take place.

Workflow and Methodology

Workflow

Data Collection and Cleaning: The dataset was assembled and refined by eliminating various distractions and organizing the textual data.
Text Tokenization and Stemming: In preparation for feature engineering processes, textual data was tokenized and stemmed.
Feature Extraction: The TF-IDF was used to extract features while other numerical features like the number of characters, and sentences were incorporated as well.
Handling Class Imbalance: The Random Over-Sampling technique was opted to create a balanced representation of the dataset in regards to class distribution.
Data Splitting: Data was divided into two portions for evaluation purposes, namely the training set and the testing set.
Model Training: Machine learning models were trained on the classification tasks after optimizing hyperparameters.
Model Evaluation: Models were assessed in terms of performance using accuracy and confusion matrices with visual representation.
Model Saving and Testing: The model that had the best performance was kept and applied to new sets of data to test its validity.

Methodology

Data preprocessing involved cleaning, tokenizing, stemming, and feature extraction with TF-IDF.
Combined text features with numerical data like character and sentence counts.
Balanced the dataset using Random Over-Sampling for equal class representation.
Used machine learning models like Logistic Regression, Decision Tree, Naive Bayes, and XGBoost.
Evaluated models with performance metrics and visualizations for deeper insights.
Saved the trained model for future use and ensured reliability by testing on unseen data.

Data Collection and Preparation

Data Collection:

In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.

Data Preparation Workflow:

Import dataset and exclude distractions such as URLs, special characters, and user handles.
Converted it to lowercase for consistency across the whole dataset.
Tokenized the text data, into individual words to support a detailed analysis.
Applied stemming to reduce words to their root forms so they are uniform.
TF-IDF was used to extract features to capture the importance of words.
Added numerical features like character and sentence counts to enrich the feature set.
Handled missing values by removing rows
Used Random Over-Sampling to balance the dataset.

Code Explanation

STEP 1:

Mounting Google Drive

First, mount Google Drive to access the dataset that is stored in the cloud.

from google.colab import drive
drive.mount('/content/drive')

Import Library

This section of the code imports the relevant libraries required for data analysis, data visualization, text processing, and even machine learning. This contains libraries such as Scikit-learn, NLTK, and XGBoost for carrying out tasks such as tokenizing, feature extraction, training a model, and then evaluating the performance of that said model. In addition, this also addresses the issue of class imbalance using RandomOverSampler and allows for some visualizations allowing for WordCloud and Seaborn.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline
import re
import joblib
import random
from imblearn.over_sampling import RandomOverSampler
from scipy.sparse import hstack  # To combine sparse matrices
from wordcloud import WordCloud
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.stem import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import LabelEncoder
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import BernoulliNB
from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

STEP 2:

Loading Data and Checking Shape

This code loads the CSV file. After loading the dataset it prints the dataset’s shape to check the number of rows and columns. The %time magic command in the notebook records the time taken to perform the task.

Aionlinecourse_df = pd.read_csv('/content/drive/MyDrive/New 90 Projects/Project_10/Combined Data.csv', index_col=0)
Aionlinecourse_df.shape

Previewing Data

This code displays the dataset's first few rows for a quick overview.

Aionlinecourse_df.head()

Visualizing Status Distribution

The following code generates a horizontal bar graph illustrating how the different values in the status column of the data set are distributed. It applies a ‘Dark2’ color palette from Seaborn and eliminates the top and right borders of the chart for a more aesthetic appearance.

Aionlinecourse_df.groupby('status').size().plot(kind='barh', color=sns.palettes.mpl_palette('Dark2'))
plt.gca().spines[['top', 'right',]].set_visible(False)

Statistical Description of Dataset

This code delivers a wide-ranging statistical description of all the number columns in the Aionlinecourse_df DataFrame. It includes information such as average values, dispersion, and the lowest and the highest values of each of the presented columns.

Aionlinecourse_df.describe()

Dataset Overview

It provides the schema of the data present in the dataset and gives details about column names, their data types, and the number of non-null entries.

Aionlinecourse_df.info()

Checking Missing Values

This code calculates the overall number of null values present in every column of the Aionlinecourse_df Data Frame. This helps in identifying null values for further processing of the data.

Aionlinecourse_df.isna().sum()

Handling Missing Values

This code eliminates the Aionlinecourse_df DataFrame of all rows that have any missing values and then checks the missing values to ensure that no missing values are present.

Aionlinecourse_df.dropna(inplace = True)
Aionlinecourse_df.isna().sum()

Counting the Unique Values in the Status

This code counts the status’s unique values present in Aionlinecourse_df DataFrame and provides help in understanding the distribution of the different categories present.

Aionlinecourse_df.status.value_counts()

Visualizing Mental Health Conditions Distribution

The code creates a visual representation of the distribution of status categories, in a pie chart form.

# Count the occurrences of each category
status_counts = Aionlinecourse_df['status'].value_counts()
# Define colors for each category (7 colors)
colors = ['#FF6F61', '#6B5B95', '#88B04B', '#FFD662', '#009688', '#34568B', '#EFC050'];
# Create the pie chart
plt.figure(figsize=(7, 7))
plt.pie(status_counts, labels=status_counts.index, autopct='%1.1f%%',
        startangle=140, colors=colors, shadow=True)
plt.title('Distribution of Mental Health Conditions')
plt.axis('equal')  # Equal aspect ratio ensures that the pie is drawn as a circle.
# Display the chart
plt.tight_layout()
plt.show()