Decision Tree Regression: This Regression is based on the decision tree structure. A decision tree is a form of a tree or hierarchical structure that breaks down a dataset into smaller and smaller subsets. At the same time, an associated decision tree is incrementally developed. The tree contains decision nodes and leaf nodes. The decision nodes(e.g. Outlook) are those nodes represent the value of the input variable(x). It has two or more than two branches(e.g. Sunny, Overcast, Rainy). The leaf nodes(e.g. Hours Played) contains the decision or the output variable(y). The decision node that corresponds to the best predictor becomes the topmost node and called the root node.
Decision Trees are used for both Classification and Regression tasks. In this tutorial, we will focus on Regression trees.
Lets consider a scatter plot of a certain dataset.
Here, we take a dataset that contains two independent variables X1, and X2 and we are predicting a third dependent variable y. You can not find it on the scatterplot as it has two dimensions. To visualize y, we need to add another dimension and after that, it would like the following:
How does a Decision Tree Work for Regression?
Well, for this our decision tree would make some splits on the dataset based on information entropy( information entropy tells how much information there is in an event). This is basically dividing the points into some groups. The algorithm decides the optimal number of splits and splits the dataset accordingly. The figure will make it clear
Here we can see the decision tree made four splits and divided the data points into five groups.
Now, this algorithm will take the average value of each group and based on that values it will build the decision tree for this dataset. The tree would look like the following:
The decision tree above shows that whenever a value of y falls in one of the leaves, it will return the value of that leaf as the prediction for that y value.
You can download the dataset from here.
First of all, we will import the essential libraries.
# Importing the Essential Librariesimport numpy as npimport pandas as pdimport matplotlib.pyplot as plt
Then, we import our dataset.
# Importing the Datasetdataset = pd.read_csv("Position_Salaries.csv")
After executing this code, the dataset will be imported to our program. Let's have a look on that dataset:
Now, we need to determine the dependent and independent variables. Here we can see the Level is an independent variable while Salary is dependent variable or target variable as we want to find out the salary of an employee according to his Level. So our feature matrix X will contain the Level column and the value of Salary is taken into the dependent variable vector, y.
# Creating Feature Matrix and Dependent Variable VectorX = dataset.iloc[:, 1].valuesy = dataset.iloc[:, 2].values
Now, we use DecisionTreeRegressor class from the Scikit-learn library and make an object of this class. Then we will fit the object to our dataset to make our model.
# Fitting Decision Tree Regression to the datasetfrom sklearn.tree import DecisionTreeRegressorregressor = DecisionTreeRegressor(random_state = 0)regressor.fit(X, y)
Well, our model is ready! Let's test our model to know how it predicts an unknown value.
# Predicting a new resulty_pred = regressor.predict([[6.5]])
We predict the result of 6.5 level salary. After executing the code, we get an output of $150k. To learn how closely our model predicted the value, let's visualize the training set.
# Visulizing the Training SetX_grid = np.arange(min(X), max(X), 0.01)X_grid = X_grid.reshape((len(X_grid), 1))plt.scatter(X, y, color = 'red')plt.plot(X_grid, regressor.predict(X_grid), color = 'blue')plt.title('Decision Tree Regression') plt.xlabel('Position level') plt.ylabel('Salary')plt.show()
The plot would look like this:
After executing the code, we can see a graph, where we plotting a prediction of 10 salary corresponding to 10 levels. It is nonlinear and non-continuous regression model. This graph does not look like the other linear regression models. Because the decision tree regression takes the average value of each group and assigns this value for any variable falls in that group. So the graph is not continuous rather it looks like a staircase.
From the graph, we see that the prediction for a 6.5 level is pretty close to the actual value(around $160k). So we can say it is a good regression model.