Decision Tree Regression | Machine Learning

Decision Tree Regression: This Regression is based on the decision tree structure. A decision tree is a form of a tree or hierarchical structure that breaks down a dataset into smaller and smaller subsets. At the same time, an associated decision tree is incrementally developed. The tree contains decision nodes and leaf nodes. The decision nodes(e.g. Outlook) are those nodes represent the value of the input variable(x). It has two or more than two branches(e.g. Sunny, Overcast, Rainy). The leaf nodes(e.g. Hours Played) contains the decision or the output variable(y). The decision node that corresponds to the best predictor becomes the topmost node and called the root node.


Decision Trees are used for both Classification and Regression tasks. In this tutorial, we will focus on Regression trees.

Lets consider a scatter plot of a certain dataset.



Here, we take a dataset that contains two independent variables X1, and X2 and we are predicting a third dependent variable y. You can not find it on the scatterplot as it has two dimensions. To visualize y, we need to add another dimension and after that, it would like the following:


How does a Decision Tree Work for Regression?

Well, for this our decision tree would make some splits on the dataset based on information entropy( information entropy tells how much information there is in an event). This is basically dividing the points into some groups. The algorithm decides the optimal number of splits and splits the dataset accordingly. The figure will make it clear



Here we can see the decision tree made four splits and divided the data points into five groups.

Now, this algorithm will take the average value of each group and based on that values it will build the decision tree for this dataset. The tree would look like the following:


The decision tree above shows that whenever a value of y falls in one of the leaves, it will return the value of that leaf as the prediction for that y value.

Decision Tree Regression in Python: In this tutorial, we will implement Decision tree regression in Python. We will work on a dataset (Position_Salaries.csv) that contains the salaries of some employees according to their Position. Our task is to predict the salary of an employee according to an unknown level. So we will make a Regression model using Decision Tree for this task.

You can download the dataset from here.

First of all, we will import the essential libraries.

# Importing the Essential Libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

Then, we import our dataset.

# Importing the Dataset
dataset = pd.read_csv("Position_Salaries.csv")

After executing this code, the dataset will be imported to our program. Let's have a look on that dataset:


Now, we need to determine the dependent and independent variables. Here we can see the Level is an independent variable while Salary is dependent variable or target variable as we want to find out the salary of an employee according to his Level. So our feature matrix X will contain the Level column and the value of Salary is taken into the dependent variable vector, y.

# Creating Feature Matrix and Dependent Variable Vector
X = dataset.iloc[:, 1:2].values
y = dataset.iloc[:, 2].values

Now, we use DecisionTreeRegressor class from the Scikit-learn library and make an object of this class. Then we will fit the object to our dataset to make our model.

# Fitting Decision Tree Regression to the dataset
from sklearn.tree import DecisionTreeRegressor
regressor = DecisionTreeRegressor(random_state = 0), y)

Well, our model is ready!  Let's test our model to know how it predicts an unknown value.

# Predicting a new result
y_pred = regressor.predict([[6.5]])

We predict the result of  6.5 level salary. After executing the code, we get an output of $150k. To learn how closely our model predicted the value, let's visualize the training set.

# Visulizing the Training Set
X_grid = np.arange(min(X), max(X), 0.01)
X_grid = X_grid.reshape((len(X_grid), 1))
plt.scatter(X, y, color = 'red')
plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')
plt.title('Decision Tree Regression') plt.xlabel('Position level') plt.ylabel('Salary')

 The plot would look like this:


After executing the code, we can see a graph, where we plotting a prediction of  10 salary corresponding to 10 levels. It is nonlinear and non-continuous regression model. This graph does not look like the other linear regression models. Because the decision tree regression takes the average value of each group and assigns this value for any variable falls in that group. So the graph is not continuous rather it looks like a staircase.

From the graph, we see that the prediction for a 6.5 level is pretty close to the actual value(around $160k). So we can say it is a good regression model.