Project Overview
The goal of this project is to build a computer vision system that can recognize and classify hand signs from the American Sign Language (ASL) alphabet using image data. The system is designed to take grayscale images of hand gestures and predict the corresponding letter. This project is especially important in bridging the communication gap between the deaf/mute community and others who do not understand sign language.
We use a popular dataset called Sign Language MNIST, which includes thousands of labeled hand sign images. Using this data, we train various convolutional neural networks (CNNs) from scratch and also apply transfer learning using a pre-trained deep learning model (ResNet50) to improve accuracy.
This kind of system has the potential to be developed further into real-time applications, such as translating sign language into text or speech, which can be used in schools, hospitals, and public services.
Prerequisites
Before you dive into this project, it's important to have some foundational knowledge and tools ready:
- Programming Skills:
- Basic understanding of Python syntax and functions.
- Some familiarity with data structures like lists, dictionaries, and arrays.
- Mathematics and Machine Learning:
- Understanding of how neural networks work.
- Basics of model training, such as epochs, loss functions, accuracy, and overfitting.
- Image Processing:
- Knowing how image data is represented in arrays.
- Understanding grayscale vs. color images and how images are preprocessed.
- Deep Learning Libraries:
- Some experience with TensorFlow or Keras (for building and training models).
- Using libraries like matplotlib, pandas, numpy for data handling and visualization.
- Development Environment:
- Familiarity with Google Colab or Jupyter Notebook.
- Ability to install and use Python packages.
Approach
We followed a step-by-step and structured approach for the project:
- Dataset Understanding:
- We used the Sign Language MNIST dataset, which contains labeled grayscale images of hand signs for 24 letters (excluding J and Z).
- Data Preprocessing:
- The raw data was in a CSV format, with pixel values for each image.
- We reshaped and normalized the data to make it suitable for CNN models.
- Model Building:
- We started with a basic CNN model.
- Then we built improved versions by adding Dropout, Batch Normalization, and other techniques.
- We used Transfer Learning by importing the ResNet50 model and fine-tuning it for our dataset.
- Evaluation and Comparison:
- We evaluated each model using metrics like accuracy and loss.
- Confusion matrices were used to analyze which signs were commonly misclassified.
- Visualizations of training curves and prediction samples helped compare performance.
- Final Output:
- The final model was able to recognize hand signs with high accuracy, demonstrating the power of deep learning for image classification tasks.
Workflow and Methodologies
The project followed a structured step-by-step process from data handling to model evaluation:
1. Data Loading and Exploration
- Loaded the training and test datasets using Pandas.
- Extracted image pixel values and labels from CSV files.
- Reshaped image data into 28x28 grayscale image arrays.
- Visualized sample images using Matplotlib to understand data distribution.
2. Data Preprocessing
- Normalized pixel values to a 0–1 range by dividing by 255.
- Converted labels to one-hot encoded format for classification.
- Split the training data into training and validation sets to monitor performance.
3. Data Augmentation
- Applied transformations like rotation, zoom, shift, and horizontal flip using ImageDataGenerator.
- Augmentation helped increase dataset variety and reduce overfitting.
4. Model Training
- Trained three CNN models with increasing complexity:
- Model 1: Basic CNN using Conv2D and MaxPooling.
- Model 2: Added Dropout and training callbacks like EarlyStopping and ReduceLROnPlateau.
- Model 3: Introduced BatchNormalization for improved training stability.
5. Transfer Learning (Model 4)
- Implemented transfer learning using ResNet50 pre-trained on ImageNet.
- Replaced the top layers with custom layers for 26-class classification.
- Froze base layers to retain learned features, fine-tuned only the top layers.
6. Model Evaluation
- Visualized training and validation accuracy/loss over epochs.
- Created confusion matrices to evaluate per-class predictions.
- Displayed actual vs. predicted images to assess real-world performance.
Data Collection and Preparation
Data Collection
The project used the Sign Language MNIST dataset, which is publicly available on Kaggle. The dataset consists of:
- Training set: 27,455 grayscale images
- Test set: 7,172 grayscale images
- Image size: 28x28 pixels
- Classes: 0 to 25 (excluding letters J and Z due to their dynamic gestures)
Data Preparation Workflow
- Loaded data from CSV files using Pandas.
- Separated image pixel values and labels.
- Reshaped the flat pixel arrays into 28x28 image matrices.
- Normalized pixel values to improve training consistency.
- Split the training set into training and validation subsets.
- Augmented data to enhance generalization and robustness.