Regularization in Machine Learning: A Beginner’s Guide

Ishwarya S
7 min readSep 26, 2024

--

Introduction

Hello, fellow data enthusiasts! 🌟 Machine learning models are powerful tools that help us solve complex problems by learning patterns from data. But there’s a catch: sometimes, these models can get a little too good at memorizing the training data, leading to a problem called overfitting. Overfitting happens when a model captures not just the true patterns in the data but also the noise and random fluctuations. As a result, it performs poorly on unseen data.

So, how do we combat this? The answer is regularization! In this blog, we’ll explore regularization in detail, why it’s needed, and dive into different types of regularization methods like L1 (Lasso), L2 (Ridge), and ElasticNet. We’ll break down the mathematics behind each method and explain when to use them.

Let’s get started!

Why Do We Need Regularization?

Imagine you have a complex machine learning model with many features (variables). If you allow the model to fit the data without any restrictions, it will likely fit even the smallest fluctuations (noise) in the training data. This means the model is overfitting — great on training data but terrible on new data.

Regularization helps to prevent overfitting by introducing a penalty term in the model’s cost function(For more about cost function refer to one of my blog here). This penalty discourages the model from fitting the data too closely by shrinking the coefficients associated with less important features.

In simpler terms: regularization keeps your model in check by reducing complexity, forcing it to focus on the important patterns instead of memorizing everything.

Types of Regularization

There are several regularization techniques, but we’ll focus on three popular ones: L1 Regularization (Lasso), L2 Regularization (Ridge), and ElasticNet. Let’s break them down one by one.

1. L1 Regularization (Lasso)

L1 regularization, also known as Lasso (Least Absolute Shrinkage and Selection Operator), adds a penalty equal to the absolute value of the coefficients to the cost function.

Mathematical Equation:

The cost function with L1 regularization looks like this:

Explanation:

  • The penalty is proportional to the absolute value of the coefficients.
  • L1 regularization encourages some coefficients to become exactly zero, effectively performing feature selection. This means Lasso not only prevents overfitting but also helps you identify which features are most important.

Mathematics Behind L1 Regularization (Lasso)

L1 Regularization, or Lasso (Least Absolute Shrinkage and Selection Operator), adds a penalty equal to the absolute value of the coefficients to the cost function. This penalty causes some of the coefficients to shrink to zero, effectively removing certain features from the model.

Let’s dive into the math behind how L1 regularization reduces feature coefficients step-by-step, using an example with numbers to make it clearer.

Example: L1 Regularization in Action

Let’s take a simplified example of a linear regression model with two features x1​ and x2​.

Step 1: Without Regularization

Assume that we fit the model and get the following coefficients:

w1=3 and w2=5

The original cost function (without regularization) is just the mean squared error (MSE).

Now, suppose we observe overfitting and decide to apply L1 regularization to reduce the feature coefficients.

Step 2: Applying L1 Regularization

Let’s assume we set the regularization parameter λ=0.5. The cost function with L1 regularization becomes:

Substitute the values of the coefficients:

So, our total cost function now includes a penalty of 4. The optimizer (such as gradient descent) will try to minimize this total cost by reducing the values of the coefficients w1​ and w2​.

Step 3: Shrinking Coefficients

During optimization, L1 regularization reduces the magnitudes of the coefficients by applying a penalty on large weights. Let’s assume the optimization process results in the following updated coefficients:

w1=2.2, w2=4.3

Notice that both coefficients w1 and w2​ have been reduced (shrinkage). If we increased λ, this shrinkage would be even stronger, and for high enough λ, one or both coefficients might become exactly zero.

Step 4: Feature Elimination (Sparsity)

One of the key features of L1 regularization is that it can force coefficients to become exactly zero. Let’s increase λ to a larger value, say λ=1.

Now, the cost function becomes:

The optimizer tries to minimize the total cost again by shrinking the coefficients further. This time, let’s say the updated coefficients become:

Now, w1​ has been shrunk to zero, effectively removing the first feature x1​ from the model. This is how L1 regularization performs automatic feature selection by zeroing out irrelevant or less important features.

When to Use L1 Regularization:

  • When you suspect that many of the features are irrelevant.
  • When you want to automatically perform feature selection as part of the learning process.

2. L2 Regularization (Ridge)

L2 regularization, also known as Ridge, adds a penalty equal to the square of the coefficients to the cost function.

Mathematical Equation:

The cost function with L2 regularization looks like this:

Explanation:

  • The penalty is proportional to the square of the coefficients.
  • Unlike L1 regularization, L2 regularization shrinks the coefficients but never forces them to be exactly zero. This means that Ridge keeps all features in the model but reduces their impact.

When to Use L2 Regularization:

  • When you have many features, and you want to reduce the impact of less important ones without eliminating them.
  • When all features are thought to contribute to the prediction but need some regularization to prevent overfitting.

3. ElasticNet Regularization

ElasticNet is a combination of L1 and L2 regularization. It adds both penalties to the cost function.

Mathematical Equation:

The cost function for ElasticNet looks like this:

Explanation:

  • ElasticNet combines the benefits of both L1 and L2 regularization. It can shrink some coefficients to zero (like Lasso) while reducing the magnitude of others (like Ridge).
  • It’s especially useful when there are many correlated features, which can cause Lasso to randomly select one feature from a group of correlated ones.

When to Use ElasticNet:

  • When you have many features and suspect some of them may be irrelevant.
  • When you have highly correlated features and want a more stable solution than Lasso.

How to Choose the Right Regularization Technique

  1. L1 (Lasso):
  • Use when you want to automatically perform feature selection. It’s great when you have many irrelevant or unimportant features.

2. L2 (Ridge):

  • Use when you believe all features are relevant but need to control overfitting. Ridge will shrink the coefficients but keep them in the model.

3. ElasticNet:

  • Use when you have a lot of features, some of which may be irrelevant, and highly correlated features. ElasticNet gives you the best of both worlds by combining L1 and L2.

The Role of the Regularization Parameter λ

The parameter λ plays a crucial role in regularization. It controls how much penalty is applied to the coefficients:

  • High λ: Forces coefficients to shrink more, leading to a simpler model with less variance (more bias).
  • Low λ: Allows the model to fit the data more closely, potentially leading to overfitting (less bias but more variance).
  • λ=0: No regularization, meaning the model is free to overfit the data.

Why Is Regularization Important?

  1. Prevents Overfitting: Regularization helps reduce the complexity of the model, ensuring it doesn’t memorize the training data and generalizes better to unseen data.
  2. Improves Model Interpretability: L1 regularization can zero out coefficients of irrelevant features, making it easier to understand which features are important.
  3. Better Predictions: Regularized models often perform better on test data because they focus on the most important patterns rather than noise.

Conclusion

Regularization is a powerful tool to improve the performance and interpretability of your machine learning models. By controlling model complexity, regularization helps prevent overfitting and ensures your model generalizes well to new, unseen data. Whether you use Lasso, Ridge, or ElasticNet, each regularization technique offers unique advantages depending on your data and use case.

Next time you build a machine learning model, consider applying regularization to enhance its performance. It might just be the key to turning a good model into a great one!

Happy learning and predicting! 🚀

--

--

Ishwarya S
Ishwarya S

Written by Ishwarya S

Data geek with 7 years’ experience. Turning numbers into insights, one line of code at a time. Let’s unravel the data universe together! 📊✨

No responses yet