Regression Decision Trees: A Step-by-Step Guide for Beginners
Introduction
Hello, data enthusiasts! In one of the previous blogs, we dove deep into the world of decision trees and explored how decision tree classification works step-by-step. You can find how a Decision Tree Classifier is built in my blog here. Today, we’re diving into the world of Regression Decision Trees. These are powerful tools for predicting continuous values. We’ll explore how these trees are built from scratch, the role of the loss function, and how gradient descent works in regression trees. Let’s get started!
What is a Regression Decision Tree?
A Regression Decision Tree is a type of decision tree that predicts continuous values. Unlike classification trees, which predict categories, regression trees predict a numerical value for a given set of features.
Key Concepts:
- Nodes and Leaves: Each internal node represents a decision based on a feature, and each leaf node represents a predicted continuous value.
- Splitting: The process of dividing a node into two or more sub-nodes based on certain conditions.
- Loss Function: Measures the difference between the actual and predicted values. In regression trees, the most common loss function is the mean squared error (MSE).
Building Regression Trees from Scratch
Let’s break down the process step-by-step:
Step 1: Start with the Entire Dataset
Begin with the entire dataset and treat it as the root of the tree.
Step 2: Select the Best Split
For each node, consider splitting the data on every feature. The goal is to find the split that minimizes the loss function (e.g., MSE).
How to Find the Best Split:
- Calculate MSE for Each Split: For every possible split, calculate the mean squared error (MSE) for the resulting sub-nodes.
- For a node t, the MSE is given by:
where Nt is the number of observations in node t, yi is the actual value, and yˉt is the mean value of node t.
2. Choose the Split with the Lowest MSE: Select the feature and value that result in the lowest MSE for the children nodes.
Step 3: Split the Node
Divide the node into two or more sub-nodes based on the selected feature and value.
Step 4: Repeat the Process
Recursively apply steps 2 and 3 to each sub-node until a stopping criterion is met (e.g., maximum depth of the tree, minimum number of samples per leaf).
Step 5: Assign Values to Leaves
Once the stopping criteria are met, assign each leaf node a value, typically the mean value of the observations in that node.
Regression Tree from Scratch
Let’s consider the dataset below, which includes the age of individuals and the effectiveness of a drug. Here we have only one feature for simplicity purposes.
Step 1:
For now, let’s focus on the first two observations with the least age. By calculating their average and considering it as the first split, we can build a simple tree that splits the observations into two groups based on Age<4.5 and Age>4.5.
Step 2:
Calculate Distance from Each Point to the Average Line
Final MSE Calculation: Final MSE=∑(Squared difference)=27421.4286
2. To calculate the split with the lowest MSE, similar to previous steps, calculate the average between the second point and the third point, then calculate MSE. Similarly, keep calculating the average between consecutive points and calculate MSE for each split.
Since there is only one feature, we can choose the first split by looking at the graph of MSE values. The MSE is least at Age<14, so we choose this to be our root node.
Step 3:
Divide the Node: Use Age<14 as the root node.
Step 4:
To calculate how the next split will be done, consider the five points before the root node as a dataset and calculate the consecutive average of age and then calculate the MSE just like we did before. Repeat similar steps for the ten points after the root node.
Step 5:
To calculate how the next split will be done, consider the five points before the root node as a dataset and calculate the consecutive average of age and then calculate the MSE just like we did before. Repeat similar steps for the ten points after the root node.
Conclusion
In this blog, we’ve explored Regression Decision Trees and how they are built from scratch. Remember, this is just the beginning. Regression Decision Trees are a powerful tool in the machine learning toolbox, and there’s much more to explore. Happy learning and predicting!