Evaluation of machine learning models is important. To build a state of the art machine learning model, you need to make sure the accuracy of your model on every test set is as good as the accuracy it has obtained from the training set.
Usually, we take a data set, split it into train and test sets. We use the training set to train the model and the test set to evaluate the performance of the model. But it is not a good approach as in production, the model would come across data quite different from the test set. Which eventually lead to degrading the performance of the model, making our evaluation faulty.
To solve this problem, we can use cross-validation techniques such as k-fold cross-validation. Cross-validation is a statistical method used to compare and evaluate the performance of Machine Learning models.
In this tutorial, we are going to learn the K-fold cross-validation technique and implement it in Python. Let's dive into the tutorial!
While building machine learning models, we randomly split the dataset into training and test sets where a maximum percentage of the data is taken into the training set. Though the test dataset is small, there is still some chance that we left some important data in there that might have improved the model. And there is a problem of high variance in the training set. Here where the idea of K-fold cross-validation comes in handy. In K-fold Cross-Validation, the training set is randomly split into K(usually between 5 to 10) subsets known as folds. Where K-1 folds are used to train the model and the other fold is used to test the model. This technique improves the high variance problem in a dataset as we are randomly selecting the training and test folds.
The steps required to perform K-fold cross-validation are given below-
Step 1: Split the entire data randomly in k folds(usually between 5 to 10). The higher number of splits leads to a less biased model. Step 2: Then fit the model with k-1 folds and test it with the remaining Kth fold. Record the performance metric. Step 3: Repeat step 2 until every k-fold serves as the test set. Step 4: Take the average of all the recorded scores. This will serve as the final performance metric of your model.
You can download the dataset from here.
This is a classification task. And for this, we will build a Kernel SVM classification model.
First, we will use the conventional method, randomly split the dataset into training and test set, train the model, and evaluate it on the test set.
Then We will implement the K-fold cross-validation technique to improve our model.
First of all, we need to import some essential libraries.
# Importing the libraries import numpy as np import matplotlib.pyplot as plt import pandas as pd
Now, we will import the dataset and make the feature matrix X and the dependent variable vector y.
# Importing the dataset dataset = pd.read_csv('Social_Network_Ads.csv') X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values
Now, we will split the dataset into training and test sets.
# Splitting the dataset into Training set and Test set from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
We, need to feature scale our training and test sets for an improved result.
# Feature Scaling from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Now, we will fit Kernel SVM into our training set and predict how it performs on the test set.
# Fitting Kernel SVM to the Training set from sklearn.svm import SVC classifier = SVC(kernel = 'rbf', random_state = 0) classifier.fit(X_train, y_train) # Predicting the Test set results
y_pred = classifier.predict(X_test)
To calculate the accuracy of our Kernel SVM model we will build the confusion matrix.
# Making the Confusion Matrix from sklearn.metrics import confusion_matrix cm = confusion_matrix(y_test, y_pred)
Let's see how accurate our model is
From the above matrix, we can see that the accuracy of our Kernel SVM model is 93% Now, let's see how we can improve the performance metric of our model using K-fold cross-validation with k = 10 folds.
# Applying k-Fold Cross Validation from sklearn.model_selection import cross_val_score accuracies = cross_val_score(estimator = classifier, X = X_train, y = y_train, cv = 10) accuracies.mean() accuracies.std()
Let's see the accuracies for all the folds.
The mean value for the accuracies is 90% with a mean deviation of 6%. That means our model is accurate for 96% or 84% time.
It depends on how much CPU power you have or willing to spend. A lower value of K means less variance which leads to more bias. And having a higher value means more variance and lower bias.
The computational cost for different values is the primary concern while choosing the value. A higher K value requires more computational time and power and vice versa. Lowering down folds value will not be helpful to find the most performing model and taking a higher value will take a longer time to completely train the model.
So, you need to find a spot where the cost and performance tradeoff gets to an equilibrium state. This can be done during hyper tuning analysis.
Finally, the most important thing is the size of your data. If the amount of data is small, using a k-fold cross-validation scheme would not make any sense. And if the amount of data is large, you must give efforts to choose the perfect value of K.
There are several variants of k fold cross-validation, used for different purposes. Many variants are implemented in the Scikit-Learn library. Some of the most widely used cross-validation techniques are-
There are some advantages of k fold cross-validation with over validation techniques. There are some disadvantages as well. Let's have a look at them.
There is a continuous debate on which method of validation is best for a model. Most of the time it is k fold. Let's compare k fold with other validation methods.
In the holdout method, we split the data set into train and test sets. The model is trained on the train set and then evaluated by the test set. The method is simple and easy to implement. But not a better indicator of the performance of your model. As cross-validation uses multiple splits for train and test sets, it gives you a better indication of the performance of your model on unseen data.
The holdout method comes in handy when you are using a large data set or you are in short of time as cross-validation incurs more computational cost. But yet you should apply cross-validation whenever possible instead of the holdout method.
In the bootstrapping method, the data set is resampled at random to make several data sets so that the model can be evaluated with a wide number of data samples. The method samples the original data and takes the 'not chosen' samples as test cases. Then the average accuracy score is taken as the estimation of the model performance.
In essence, the bootstrap method can be seen more as a variance/bias estimation rather than a validation technique. It is useful in ensemble methods such as a random forest as it can create multiple data sets from the original data. We can use each bootstrap data set for building a number of single models (e.g. a decision tree) and combine all models with an ensemble model. Then we will take the majority voting of all these single models to get our final model performance.
On the other hand, k fold cross-validation splits the original data set to rigorously train and test the model.
In this sense, we can see that bootstrapping is not the right kind of evaluation model like k fold cross-validation. We still need k fold to evaluate the model's performance. The bootstrap method will be a weaker evaluation technique if used alone.
In this tutorial, I tried to explain all the important aspects of k fold cross-validation. In summary, the key take ways of the tutorial are-
Hope the tutorial has served you the concepts well. Do you have any questions about the concepts of the tutorial? Please let me know in the comments. You can also give feedback to improve the tutorial. I will gladly accept any new idea to make things better.
Happy Machine Learning!
© aionlinecourse.com All rights reserved.