Project Overview
We commenced the project by collecting and normalizing a set of text statements containing descriptions of mental health status. Our primary objective was to categorize the statements into known classes. We implemented TF-IDF feature extraction to collect necessary features, trained various classification models using machine learning algorithms, and finally employed visual techniques for better comprehension. In this process, we faced issues such as class imbalance, which we handled through resampling, and maximized the performance of our models by optimizing hyperparameters. Lastly, we persistently stored the trained model in anticipation of predicting the future and put it through a real-world scenario test!
Prerequisites
Learners must develop some skills before undertaking this project. Here’s what you should ideally know:
- A fundamental understanding of Python programming language, machine learning concepts and ideas, and text preprocessing methods and techniques.
- Knowledge and understanding of libraries including Pandas, NumPy, Scikit-learn, Matplotlib, and Seaborn for data analysis and data visualization.
- Knowledge of feature extraction methods, such as TF-IDF backed by implementations, and the handling of class imbalance through resampling.
- Familiarity with the end-to-end cycle of designing a classifier including training, evaluation, persistence, and prediction on machine learning models.
- Jupyter Notebook, VScode, or a Python-compatible IDE.
Approach
Initially, the dataset was pre-processed by eliminating irrelevant elements such as URLs, and special symbols, and tokenizing the content. Stemming was included to sanitize the words used in their base forms, and feature extraction was done with TF-IDF and other numeric measures like character and sentence counts. To achieve balance in the dataset, Random Over-Sampling was performed while making sure all categories were fairly represented. Models of machine learning composed of Logistic Regression, Decision Tree, Naive Bayes, and XGBoost were developed, improved, and performed in training and testing for the use of various metrics including accuracy and confusion matrices. Lastly, the model with the best performance was optimized and applied to new data for effective prediction to take place.
Workflow and Methodology
Workflow
- Data Collection and Cleaning: The dataset was assembled and refined by eliminating various distractions and organizing the textual data.
- Text Tokenization and Stemming: In preparation for feature engineering processes, textual data was tokenized and stemmed.
- Feature Extraction: The TF-IDF was used to extract features while other numerical features like the number of characters, and sentences were incorporated as well.
- Handling Class Imbalance: The Random Over-Sampling technique was opted to create a balanced representation of the dataset in regards to class distribution.
- Data Splitting: Data was divided into two portions for evaluation purposes, namely the training set and the testing set.
- Model Training: Machine learning models were trained on the classification tasks after optimizing hyperparameters.
- Model Evaluation: Models were assessed in terms of performance using accuracy and confusion matrices with visual representation.
- Model Saving and Testing: The model that had the best performance was kept and applied to new sets of data to test its validity.
Methodology
- Data preprocessing involved cleaning, tokenizing, stemming, and feature extraction with TF-IDF.
- Combined text features with numerical data like character and sentence counts.
- Balanced the dataset using Random Over-Sampling for equal class representation.
- Used machine learning models like Logistic Regression, Decision Tree, Naive Bayes, and XGBoost.
- Evaluated models with performance metrics and visualizations for deeper insights.
- Saved the trained model for future use and ensured reliability by testing on unseen data.
Data Collection and Preparation
Data Collection:
In this project, we collected the dataset from a public repository. If you are looking to work on a real-world problem, you can get these kinds of datasets from publicly available repositories such as Kaggle, UCI Machine Learning Repository, or company-specific data. We will provide the dataset in this project so that you can work on the same dataset.
Data Preparation Workflow:
- Import dataset and exclude distractions such as URLs, special characters, and user handles.
- Converted it to lowercase for consistency across the whole dataset.
- Tokenized the text data, into individual words to support a detailed analysis.
- Applied stemming to reduce words to their root forms so they are uniform.
- TF-IDF was used to extract features to capture the importance of words.
- Added numerical features like character and sentence counts to enrich the feature set.
- Handled missing values by removing rows
- Used Random Over-Sampling to balance the dataset.