Regression Decision Trees: A Step-by-Step Guide for Beginners

Ishwarya S
5 min readAug 7, 2024

--

Introduction

Hello, data enthusiasts! In one of the previous blogs, we dove deep into the world of decision trees and explored how decision tree classification works step-by-step. You can find how a Decision Tree Classifier is built in my blog here. Today, we’re diving into the world of Regression Decision Trees. These are powerful tools for predicting continuous values. We’ll explore how these trees are built from scratch, the role of the loss function, and how gradient descent works in regression trees. Let’s get started!

What is a Regression Decision Tree?

A Regression Decision Tree is a type of decision tree that predicts continuous values. Unlike classification trees, which predict categories, regression trees predict a numerical value for a given set of features.

Key Concepts:

  • Nodes and Leaves: Each internal node represents a decision based on a feature, and each leaf node represents a predicted continuous value.
  • Splitting: The process of dividing a node into two or more sub-nodes based on certain conditions.
  • Loss Function: Measures the difference between the actual and predicted values. In regression trees, the most common loss function is the mean squared error (MSE).

Building Regression Trees from Scratch

Let’s break down the process step-by-step:

Step 1: Start with the Entire Dataset

Begin with the entire dataset and treat it as the root of the tree.

Step 2: Select the Best Split

For each node, consider splitting the data on every feature. The goal is to find the split that minimizes the loss function (e.g., MSE).

How to Find the Best Split:

  1. Calculate MSE for Each Split: For every possible split, calculate the mean squared error (MSE) for the resulting sub-nodes.
  • For a node t, the MSE is given by:

where Nt is the number of observations in node t, yi is the actual value, and yˉ​t​ is the mean value of node t.

2. Choose the Split with the Lowest MSE: Select the feature and value that result in the lowest MSE for the children nodes.

Step 3: Split the Node

Divide the node into two or more sub-nodes based on the selected feature and value.

Step 4: Repeat the Process

Recursively apply steps 2 and 3 to each sub-node until a stopping criterion is met (e.g., maximum depth of the tree, minimum number of samples per leaf).

Step 5: Assign Values to Leaves

Once the stopping criteria are met, assign each leaf node a value, typically the mean value of the observations in that node.

Regression Tree from Scratch

Let’s consider the dataset below, which includes the age of individuals and the effectiveness of a drug. Here we have only one feature for simplicity purposes.

Fig 1: Example Data
Fig 1.1: Plotting Age against Drug Effectiveness

Step 1:

For now, let’s focus on the first two observations with the least age. By calculating their average and considering it as the first split, we can build a simple tree that splits the observations into two groups based on Age<4.5 and Age>4.5.

Fig 2: The vertical line splits data with Age<4.5 and Age>4.5
Fig 2.1: Decision tree for the above split of Age<4.5

Step 2:

Calculate Distance from Each Point to the Average Line

Calculating intermediate MSE
Fig 3: Calculate distance from each point to the average line in the plot
Fig 3.1 Step by Step MSE calculation

Final MSE Calculation: Final MSE=∑(Squared difference)=27421.4286

2. To calculate the split with the lowest MSE, similar to previous steps, calculate the average between the second point and the third point, then calculate MSE. Similarly, keep calculating the average between consecutive points and calculate MSE for each split.

Fig 3.2: Plotting the MSE for all the age threshold values

Since there is only one feature, we can choose the first split by looking at the graph of MSE values. The MSE is least at Age<14, so we choose this to be our root node.

Step 3:

Divide the Node: Use Age<14 as the root node.

Fig 4: Split using Age<14
Fig 4.1 Graphical representation of split using Age<14

Step 4:

To calculate how the next split will be done, consider the five points before the root node as a dataset and calculate the consecutive average of age and then calculate the MSE just like we did before. Repeat similar steps for the ten points after the root node.

Step 5:

To calculate how the next split will be done, consider the five points before the root node as a dataset and calculate the consecutive average of age and then calculate the MSE just like we did before. Repeat similar steps for the ten points after the root node.

Conclusion

In this blog, we’ve explored Regression Decision Trees and how they are built from scratch. Remember, this is just the beginning. Regression Decision Trees are a powerful tool in the machine learning toolbox, and there’s much more to explore. Happy learning and predicting!

References

  1. https://www.youtube.com/watch?v=g9c66TUylZ4
  2. https://visualstudiomagazine.com/Articles/2023/10/02/decision-tree-regression.aspx
  3. https://www.saedsayad.com/decision_tree_reg.htm

--

--

Ishwarya S
Ishwarya S

Written by Ishwarya S

Data geek with 7 years’ experience. Turning numbers into insights, one line of code at a time. Let’s unravel the data universe together! 📊✨

No responses yet