In this tutorial, we will learn the Support Vector Machine algorithm and implement it in Python.
Support Vector Machine: Support Vector Machine is a discriminative classifier which finds the optimal hyperplane that distinctly classifies the data points in an N-dimensional space(N - the number of features). In a two dimensional space, a hyperplane is a line that optimally divides the data points into two different classes.
How the Algorithm Works:
Let's say you need to classify two different classes of data points in a two-dimensional space. Look at the following illustration.
Here we see two classes of data points, one in the red and other in green. Now, what can we do to separate these classes? We can simply draw a line that separates them. This line could be drawn anywhere in the plane.
Here, any of the lines can separate the classes. But our task is to find the best fit or optimal line that classifies the data points most accurately. Here the Support vector machine can help us to do so. This algorithm finds us the optimal line/hyperplane. It does so by finding the line with the maximum margin(i.e. the highest distance between data points of both classes).
Here support vectors are those two data points that are supporting the decision boundary(the data points which have the maximum margin from the hyperplane). ThatÃÆÃÂ¢ÃÂ¢Ã¢â¬Å¡ÃÂ¬ÃÂ¢Ã¢â¬Å¾ÃÂ¢s why this algorithm is called support vector machine.
Note: In higher dimensional space(more than two dimensions), the classes cannot be represented as single data points, so they are represented as vectors.
This is one of the simplest but yet a powerful algorithm to solve classification problems.
SVM in python: Now we will implement this algorithm in Python. For this task, we will use the dataset Social_Network_Ads.csv. Let's have a glimpse of that dataset.
This dataset contains the buying decision of a customer based on gender, age and salary. Now, using SVM, we need to classify this dataset to predict the decision for unknown data points.
You can download the whole dataset from here.
First of all, we need to import the essential libraries to our program.
import numpy as np import matplotlib.pyplot as plt import pandas as pd
Now, lets import the datset.
dataset = pd.read_csv('Social_Network_Ads.csv')
In the dataset, the Age and EstimatedSalary columns are independent and the Purchased column is dependent. So we will take both the Age and EstimatedSalary in our feature matrix and the Purchased column in the dependent variable vector.
X = dataset.iloc[:, [2, 3]].values y = dataset.iloc[:, 4].values
Now, we will split our dataset in training and test sets.
from sklearn.model_selection import train_test_split X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.25, random_state = 0)
We need to scale our dataset for getting a more accurate prediction.
from sklearn.preprocessing import StandardScaler sc = StandardScaler() X_train = sc.fit_transform(X_train) X_test = sc.transform(X_test)
Well, its time to fit the SVM algorithm to our training set. For this, we use the SVC class from the ScikiLearn library.
from sklearn.svm import SVC classifier = SVC(kernel = 'linear', random_state = 0) classifier.fit(X_train, y_train)
Note: Here kernel specifies the type of algorithm we are using. You will know about it in detail in our Kernel SVM tutorial. For simplicity, here we choose the linear kernel.
Our model is ready. Now, let's see how it predicts for our test set.
y_pred = classifier.predict(X_test)
To see how good is our SVM model is, let's calculate the predictions made by it using the confusion matrix.
from sklearn.metrics import confusion_matrix
cm = confusion_matrix(y_test, y_pred)
The output of the confusion matrix will be
Now, let's visualize our test set result.
# Visualising the Test set results
from matplotlib.colors import ListedColormap
X_set, y_set = X_test, y_test
X1, X2 = np.meshgrid(np.arange(start = X_set[:, 0].min() - 1, stop = X_set[:, 0].max() + 1, step = 0.01),
np.arange(start = X_set[:, 1].min() - 1, stop = X_set[:, 1].max() + 1, step = 0.01))
plt.contourf(X1, X2, classifier.predict(np.array([X1.ravel(), X2.ravel()]).T).reshape(X1.shape),
alpha = 0.75, cmap = ListedColormap(('red', 'green')))
for i, j in enumerate(np.unique(y_set)):
plt.scatter(X_set[y_set == j, 0], X_set[y_set == j, 1],
c = ListedColormap(('red', 'green'))(i), label = j)
plt.title('SVM (Test set)')
The graph will like the following
From the above graph, we can see that our model tries to find the optimal line that separates the data points accurately.
This tutorial only explains SVM in two-dimensional space, in the next tutorial we will see SVM in higher dimensional spaces.