In this tutorial, we are going to understand the decision tree regression and implement it in Python
This Regression is based on the decision tree structure. A decision tree is a form of a tree or hierarchical structure that breaks down a dataset into smaller and smaller subsets. At the same time, an associated decision tree is incrementally developed. The tree contains decision nodes and leaf nodes. The decision nodes(e.g. Outlook) are those nodes that represent the value of the input variable(x). It has two or more than two branches(e.g. Sunny, Overcast, Rainy). The leaf nodes(e.g. Hours Played) contain the decision or the output variable(y). The decision node that corresponds to the best predictor becomes the topmost node and called the root node.
Decision Trees are used for both Classification and Regression tasks. In this tutorial, we will focus on Regression trees.
Let's consider a scatter plot of a certain dataset.
Here, we take a dataset that contains two independent variables X1, and X2 and we are predicting a third dependent variable y. You can not find it on the scatterplot as it has two dimensions. To visualize y, we need to add another dimension and after that, it would like the following:
Well, for this our decision tree would make some splits on the dataset based on information entropy( information entropy tells how much information there is in an event). This is basically dividing the points into some groups. The algorithm decides the optimal number of splits and splits the dataset accordingly. The figure will make it clear
Here we can see the decision tree made four splits and divided the data points into five groups.
Now, this algorithm will take the average value of each group and based on that values it will build the decision tree for this dataset. The tree would look like the following:
The decision tree above shows that whenever a value of y falls in one of the leaves, it will return the value of that leaf as the prediction for that y value.
Let's implement the above idea in Python. We will work on a dataset (Position_Salaries.csv) that contains the salaries of some employees according to their Position. Our task is to predict the salary of an employee according to an unknown level. So we will make a Regression model using Decision Tree for this task.
You can download the dataset from here.
First of all, we will import the essential libraries.
# Importing the Essential Libraries
import numpy as np import pandas as pd
import matplotlib.pyplot as plt
Then, we import our dataset.
# Importing the Dataset
dataset = pd.read_csv("Position_Salaries.csv")
After executing this code, the dataset will be imported into our program. Let's have a look at that dataset:
Now, we need to determine the dependent and independent variables. Here we can see the Level is an independent variable while Salary is the dependent variable or target variable as we want to find out the salary of an employee according to his Level. So our feature matrix X will contain the Level column and the value of Salary is taken into the dependent variable vector, y.
# Creating Feature Matrix and Dependent Variable Vector
X = dataset.iloc[:, 1:2].values y = dataset.iloc[:, 2].values
Now, we use DecisionTreeRegressor class from the Scikit-learn library and make an object of this class. Then we will fit the object to our dataset to make our model.
# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor regressor = DecisionTreeRegressor(random_state = 0)
regressor.fit(X, y)
Well, our model is ready! Let's test our model to know how it predicts an unknown value.
# Predicting a new result
y_pred = regressor.predict([[6.5]])
We predict the result of 6.5 level salary. After executing the code, we get an output of $150k. To learn how closely our model predicted the value,
lets visualize the training set.
# Visulizing the Training Set
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red') plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Decision Tree Regression')
plt.xlabel('Position level')
plt.ylabel('Salary') plt.show()
The plot would look like this:
After executing the code, we can see a graph, where we plotting a prediction of 10 salaries corresponding to 10 levels. It is a nonlinear and non-continuous regression model. This graph does not look like the other linear regression models. Because the decision tree regression takes the average value of each group and assigns this value for any variable that falls in that group. So the graph is not continuous rather it looks like a staircase.
From the graph, we see that the prediction for a 6.5 level is pretty close to the actual value(around $160k). So we can say it is a good regression model.
Decision tree for regression comes with some advantages and disadvantages, let's have a look at them-
Here I answered some of the frequently asked questions about decision tree regression
By definition, entropy is the measure of the total disorder in a system. A decision tree uses a top-down approach to build a model by continuously splitting the data into small portions. Before each split, It calculates the entropy to understand the information gain it would get from a split. Entropy is the main input to the information gain equation. The Decision tree model calculates the entropy for the parent node and the child node, and then it finds the information gain using these two measures. The formula for entropy is like the following-
Here
This entropy is used in an information gain equation which is like the following-
Here
The goal of a decision tree model is to decrease entropy and increase the information gain.
It depends on the majority of features. When a node has all homogenous data i.e. all the data points are similar, the entropy will be the lowest. But when the node contains data points equally from each feature, in other words, there is no majority of a particular feature, then the node will experience the maximum entropy.
The split is done by calculating the total value of information gain. Higher information gain is dependent on the lower entropy of a node. So, to find the best split, we need to decrease the entropy of a node. This will help to increase the information gain, resulting in the best split for the decision tree.
Both classification and regression use the same decision tree structure. Hence, there are not many differences between regression and a classification tree. Some of the key differences are-
In this tutorial, I tried to explain all the aspects of the decision tree for regression. The key take ways from the tutorial are
Hope this tutorial has helped you to understand all the concepts clearly. If you have any questions about the tutorial, let me know in the comments.
Happy Machine Learning!
© aionlinecourse.com All rights reserved.