Vegetable classification with Parallel CNN model

The Vegetable Classification project shows how CNNs can sort vegetables efficiently. As industries like agriculture and food retail grow, automating vegetable identification becomes crucial. This project provides a guide to building both custom and parallel CNN models. These models help improve classification accuracy, reduce manual effort, and enhance quality control.

By learning these methods, you can solve harder classification problems in many areas. The project shows how machine learning helps with smart farming, inventory, and sustainability. It highlights how automating vegetable sorting can benefit real-world tasks.

Project Overview

This project aims to build a parallel CNN model that classifies vegetable types. The model analyzes vegetable images and finds key features that set each type apart. It uses two branches of convolutional layers. These branches merge into a dense network for classification. We also trained a traditional CNN model to compare its performance with that of parallel CNN. We check the results to ensure vegetables are sorted correctly into their categories.

This tutorial covers data preparation, model evaluation, and visualization. It also includes troubleshooting steps. Along with vegetable sorting, it teaches key machine learning ideas. These include data cleaning, augmentation, and model improvement. These concepts apply to broader classification tasks as well.

Prerequisites

Before starting, you should know the basics of Python and machine learning. You should also understand deep learning concepts. Specifically, knowledge of convolutional neural networks (CNNs) will be essential. Familiarity with TensorFlow and Keras will make the implementation easier to follow. Experience with datasets and processing image data using Pandas and NumPy is useful.

You need access to Google Colab or Jupyter Notebook to run the code and handle computations. Additionally, Basic knowledge of metrics like accuracy, precision, recall, and AUC is helpful. It will help you measure the model's performance well.

Approach

In this project, we use a machine-learning method with parallel CNN models. The traditional CNN model is our benchmark. The parallel CNN uses multiple branches of layers for deeper feature extraction. We train both models on the same vegetable image dataset. We focus on testing how well the models generalize using validation and test sets.

We use image preprocessing methods like resizing, normalization, and data augmentation. These steps help with training and prevent over-fitting. We also use callbacks like early stopping and learning rate reduction. This improves training time and performance. We show the results with a confusion matrix, classification report, and training metrics.

Workflow and Methodology

The overall workflow of this project includes:

Data Collection: Gathering images of different vegetable types for classification.

Data Preprocessing: Resizing, normalizing, and augmenting images to prepare for training.
Model Design: Building both the custom CNN model and the parallel CNN model.
Training: Training the models with training data and evaluating them using validation data.
Evaluation: Testing the models on unseen test data to check classification accuracy.
Visualization: Plotting metrics like accuracy, precision, recall, AUC, and confusion matrix.
Optimization: We use callbacks like early stopping and learning rate reduction. These callbacks help improve the model.

The methodology involves:

Data Preprocessing: We resize and normalize raw images. This converts them into tensors for CNN input. This ensures the model processes consistent image sizes.
CNN Architecture: We design convolutional layers for traditional and parallel CNN models. These layers extract important features for classification.
Metrics: We evaluate model performance using accuracy, precision, recall, and AUC. This ensures a thorough performance check.
Callbacks: We use callbacks like early stopping and learning rate reduction. These help prevent overfitting and improve training efficiency.

Data Collection

The dataset contains images of vegetables like tomatoes, cucumbers, bean, bitter gourd, brinjal, broccoli, cabbage, capsicum, carrot, cauliflower, papaya, potato, pumpkin and radish.

We divided the dataset into three parts:

Training.
Validation, and
Test sets.

The images are stored in separate directories for easy model training and evaluation. You can upload the dataset to Google Drive and mount it in Google Colab for quick access during the project.

Data Preparation

The images first resize to 224x224 pixels to ensure uniformity in the dataset. Next, we apply normalization to scale pixel values between 0 and 1, which speeds up model training. Additionally, We use data augmentation methods like flipping, rotation, and zooming. This creates more diverse training data. This process reduces overfitting and improves the model's ability to generalize.

Data Preparation Workflow

Image Resizing: We make sure all images are 224x224 pixels. This keeps the dimensions uniform.
Normalization: We scale pixel values from 0-255 to 0-1. This improves training efficiency.
Augmentation: We use data augmentation methods like flipping, rotation, and zooming. This creates more diverse training data.
Dataset Splitting: We split the dataset of training, validation, and test. This helps us develop and evaluate the model properly.

Code Explanation

STEP 1:

You can mount your Google Drive in a Google Colab notebook with this piece of code. This makes it easy to view files saved in Google Drive. In Colab, you can change and analyze data. You can also train models.

from google.colab import drive
drive.mount('/content/drive')

Import the necessary packages.

This block of code sets up the necessary tools and layers to build a CNN model. It trains and evaluates the model for image classification using TensorFlow and Keras.

import tensorflow as tf
from tensorflow import keras
from keras.models import Sequential
from tensorflow.keras.metrics import SparseCategoricalAccuracy
from tensorflow.keras.layers import Dense, Flatten, Conv2D, MaxPooling2D
from tensorflow.keras.layers import Dropout, BatchNormalization
from keras.callbacks import EarlyStopping, ReduceLROnPlateau
from sklearn.metrics import classification_report
import tensorflow.keras.layers as layers
from tensorflow.keras.layers import Input, GaussianNoise
from tensorflow.keras.models import Model
from tensorflow.keras.metrics import Precision, Recall, AUC
from sklearn.metrics import confusion_matrix
from tensorflow.keras.metrics import SparseCategoricalAccuracy
import seaborn as sns
import numpy as np
import pandas as pd
import random
import time
import math
import matplotlib.pyplot as plt
import plotly.express as px
%matplotlib inline

Check GPU availability

This code checks for available GPUs and sets them to use memory dynamically. It then returns the number of GPUs detected.

gpus = tf.config.experimental.list_physical_devices('GPU')
for gpu in gpus:
    tf.config.experimental.set_memory_growth(gpu, True)
len(gpus)

STEP 2:

Data processing

This block of code initializes variables for a TensorFlow data pipeline. It sets the batch size and image dimensions. It also defines a random seed and the path to the dataset. Additionally, it enables automatic tuning for better data-loading performance.

batch_size = 32
img_height = 224
img_width = 224
seed = 42
PATH = "/content/drive/MyDrive/Vegetable Images"
AUTOTUNE = tf.data.experimental.AUTOTUNE

This code creates a training dataset by loading images from `{PATH}/train`. It resizes the images to specific dimensions. The `image_dataset_from_directory` function batches and shuffles the dataset. A seed is used for randomization.

train_data = tf.keras.utils.image_dataset_from_directory(
    f"{PATH}/train",
    seed=seed,
    image_size=(img_height, img_width),
    batch_size=batch_size,
    shuffle=True
)

This code creates a testing dataset by loading and resizing images from `{PATH}/test`. The `image_dataset_from_directory` function resizes images and batches them. It keeps the original order (shuffle=False) for consistent evaluation.

test_data = tf.keras.utils.image_dataset_from_directory(
  f"{PATH}/test",
  seed=seed,
  image_size=(img_height, img_width),
  batch_size=batch_size,
  shuffle=False
)

This code creates a validation dataset. It loads and resizes images from the directory `{PATH}/validation`. Furthermore, the image_dataset_from_directory function resizes and batches the images. It shuffles the images (shuffle=True) using a seed. This ensures a random order for robust model evaluation.

val_data = tf.keras.utils.image_dataset_from_directory(
  f"{PATH}/validation",
  seed=seed,
  image_size=(img_height, img_width),
  batch_size=batch_size,
  shuffle=True
)

Show class names

This code gets and shows the list of class names (labels) in the training dataset (`train_data`). Additionally, these class names match the different vegetable categories in your dataset.

class_names = train_data.class_names
class_names

Counting Unique Classes in the Dataset

This code calculates the total number of unique classes in the dataset. It does this by determining the length of the class_names list. Then, it displays the result. This gives you the number of different vegetable types that the model will classify.

num_classes = len(class_names)
num_classes

STEP 3:

Data Analysis and Visualization

Plotting distributions

This plot_label_distribution function visualizes label distribution in a dataset using Plotly. It combines all labels into a single array, maps them to class names, and creates a DataFrame. Then, it uses Plotly's px.histogram to create a count plot. Each label is shown with distinct colors. It adds titles and labels to the plot. Finally, it displays the plot.

def plot_label_distribution(dataset, class_names, num_classes):
    labels = np.concatenate([batch[1] for batch in dataset], axis=0)
    df = pd.DataFrame(labels, columns=['Labels'])
    df['Labels'] = df['Labels'].map({i: class_names[i] for i in range(num_classes)})
    # Using Plotly to create a count plot with different colors for different labels
    fig = px.histogram(df, x='Labels', color='Labels', title='Label Distribution',
                       labels={'Labels': 'Label'},
                       category_orders={"Labels": class_names},
                       color_discrete_sequence=px.colors.qualitative.Safe)
    fig.update_layout(xaxis_title='Labels', yaxis_title='Count',
                      title_x=0.5, showlegend=False)
    fig.show()

Show plot of train data distribution

This block of code is called the plot_label_distribution function. The function creates a visual plot to show the distribution of labels. It shows how many images belong to each vegetable class in the train_data dataset. The plot helps you check if the dataset is balanced. It also highlights any classes with significantly more or fewer images. This information is important for model training.

plot_label_distribution(train_data, class_names, num_classes)

Show plot of validation data distribution

This code creates a visual plot of label distribution for the validation dataset. It uses the `plot_label_distribution` function. It shows the distribution of images across vegetable classes in the validation set. This helps ensure that the validation data is balanced. It also checks if the validation data is representative of the training data.

plot_label_distribution(val_data, class_names, num_classes)

Show plot of test data distribution

This code creates a visual plot of label distribution for the test dataset. It uses the `plot_label_distribution` function. It shows how images are spread across vegetable classes in the test set. This allows you to assess the balance of the test data. It helps check if the test data is suitable for evaluating the model's performance.

plot_label_distribution(test_data, class_names, num_classes)

Step 4:

Prepare dataset for training and evaluation

This block of code preprocesses train_data, test_data, and val_data by unbatching them first. For `train_data` and `val_data`, it caches, shuffles, and re-batches them with a set batch size. It also improves performance. `Test_data` is unbatched, then re-batched without caching or shuffling. It also prefers to load data efficiently during training and evaluation.

train_data = train_data.unbatch().cache().shuffle(2000).batch(batch_size, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)
test_data = test_data.unbatch().batch(batch_size, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)
val_data = val_data.unbatch().cache().shuffle(1000).batch(batch_size, drop_remainder=True).prefetch(buffer_size=AUTOTUNE)

Plotting example images

The function `closestDivisors` finds two divisors of a number n. It returns the closest pair of these divisors. It starts from the square root of n and decrements until it finds a divisor. Then, it returns the closest divisors.

def closestDivisors(n):
    a = round(math.sqrt(n))
    while n%a > 0: a -= 1
    return a, n//a

The `plot_images` function shows a batch of images from a dataset. It defaults to `train_data`. It calculates the grid layout by finding the closest divisors of the batch size for rows and columns. Then, it displays the images in a grid. It sets each subplot title to the corresponding class name. It hides the axes to create a cleaner view.

def plot_images(data=train_data):
    n_rows, n_cols = closestDivisors(batch_size)
    plt.figure(figsize=(n_cols*2, int(n_rows*1.8)))
    for images, labels in data.take(1).cache(): # "take" takes random batch
        for i in range(n_rows*n_cols):
            ax = plt.subplot(n_rows, n_cols, i + 1)
            plt.imshow(images[i].numpy().astype("uint16"))
            plt.title(class_names[labels[i]])
            plt.axis("off")