In this tutorial, we are going to learn the K-fold cross-validation technique and implement it in Python.
While building machine learning models, we randomly split the dataset into training and test sets where a maximum percentage of the data is taken into the training set. Though the test dataset is small, there is still some chance that we left some important data in there that might have improved the model. And there is a problem of high variance in the training set. To solve this, problems we use the idea of K-fold cross-validation.
Cross-validation is a technique that is used to evaluate machine learning models by resampling the training data for improving performance.
In K-fold Cross-Validation, the training set is randomly split into K(usually between 5 to 10) subsets known as folds. Where K-1 folds are used to train the model and the other fold is used to test the model. This technique improves the high variance problem in a dataset as we are randomly selecting the training and test folds.
The steps required to perform K-fold cross-validation are given below-
Step 1: Split the entire data randomly in k folds(usually between 5 to 10). The higher number of splits leads to less biased model.
Step 2: Then fit the model with k-1 folds and test it with the remaining Kth fold. Record the performance metric.
Step 3: Repeat step 2 until every k-fold serves as the test set.
Step 4: Take the average of all the recorded scores. This will serve as the final performance metric of your model.
K-fold cross-validation in Python: Now, we will implement this technique to validate our machine learning model. For this task, we will use "Social_Network_Ads.csv" dataset. We will implement the K-fold cross-validation technique to improve our Kernel SVM classification model.
You can download the dataset from here.
First of all, we need to import some essential libraries.
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd
Now, we will import the dataset and make the feature matrix X and the dependent variable vector y.
# Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values
Now, we will split the dataset into training and test sets.
# Splitting the dataset into the Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
We, need to feature scale our training and test sets for an improved result.
# Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Now, we will fit Kernel SVM to our training set and predict how it performs on the test set.
# Fitting Kernel SVM to the Training set from sklearn.svm import SVC classifier = SVC(kernel = 'rbf', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results y_pred = classifier.predict(X_test)
To calculate the accuracy of our Kernel SVM model we will build the confusion matrix.
# Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)
Let's see how accurate our model is
From the above matrix, we can see that the accuracy of our Kernel SVM model is 93%
Now, let's see how we can improve the performance metric of our model using K-fold cross-validation with k = 10 folds.
# Applying k-Fold Cross Validation from sklearn.model_selection import cross_val_score accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) accuracies.mean() accuracies.std()
Let's see the accuracies for all the folds.
The mean value for the accuracies is 90% with a mean deviation of 6%. That means our model is accurate for 96% or 84% time.