Unlocking the Secrets of Linear Regression: Assumptions, Implementation, and Results Explained

6 min readJun 27, 2024

Introduction

Hello, data explorers! Are you ready to embark on an exciting adventure in the world of Linear Regression? In this blog, we’ll explore the key assumptions behind linear regression, walk through a basic implementation with a small simulated dataset, and learn how to check and interpret the results using OLS (Ordinary Least Squares) in Python. Whether you’re a seasoned data scientist or just starting, we’ll keep it fun and intuitive. Let’s dive in!

Linear Regression Assumptions

Before diving into the implementation, it’s crucial to understand the assumptions that linear regression relies on. These assumptions ensure our model’s accuracy and reliability. Here are the key assumptions:

Linearity: The relationship between the independent and dependent variables should be linear.
Independence: Observations should be independent of each other.
Homoscedasticity: The residuals (errors) should have constant variance at every level of the independent variable.
Normality: The residuals should be approximately normally distributed.
No Multicollinearity: Independent variables should not be too highly correlated with each other.

Basic Implementation with Simulated dataset

Let’s consider a use case where we have the land area in square feet and we need to predict the price of the land.

Step 1: Import Libraries and Simulate Data

import pandas as pd
import numpy as np
import statsmodels.api as sm
import matplotlib.pyplot as plt
import seaborn as snsPy

# Creating simulated dataset
land_area = [1000,1100,1200,1300,1400,1500,1600,1700,1800,1900]
price = [11000000,13000000,12000000,15000000,17000000,18000000,18000000,19000000,20000000,22000000]
data = pd.DataFrame({"land_area":land_area,"price":price})

Step 2: Exploratory Data Analysis (EDA)

Let’s perform some simple EDA to understand the data better.

Summary Statistics

print(data.describe())

Visualizing the Data

# Creating scatter plots to check linearity between dependent and 
# independent variables
plt.scatter(data['land_area'],data['price'])
plt.xlabel("Land Area")
plt.ylabel("Price")
plt.show()

Independent and Dependent variables are Linearly related

Step 3: Fit the Model

X = data["land_area"]
y = data["price"]
# Add a constant to the independent variables
X = sm.add_constant(X)
# Fit the OLS model
model = sm.OLS(y,X).fit()

Step 4: Interpreting OLS Results

Let’s look at the summary of our model.

# Model summary
print(model.summary())

Explaining OLS Results

The OLS summary provides a lot of information. Here are the key components explained in simple terms:

R-squared:

This value explains how well the model explains the variability in the data. Intuitively, R-squared indicates how well the independent variables explain the variability in the dependent variable. Higher values mean better fit.

Adjusted R-squared:

The R-squared always increases when a new feature is added to the model. Adjusted R-squared, adjusts the R-squared value by accounting for the number of predictors. It is useful when comparing models with a different number of predictors.

F-Statistic and Prob(F-Statistic):

The F-statistic is a measure used in linear regression analysis to determine whether the overall regression model is a good fit for the data. It essentially tests the null hypothesis that all regression coefficients are equal to zero (except the intercept), meaning that none of the independent variables have a significant linear relationship with the dependent variable. Here’s a detailed breakdown of how the F-statistic works:

Basic Concept

The F-statistic compares the model with no predictors (just the mean of the dependent variable) to the model with predictors to see if the latter explains the variation in the dependent variable significantly better.

Components of the F-Statistic

The F-statistic is calculated using two key components:

Mean Sum of Squares of Regression (MSR): This measures the variance explained by the regression model.
Mean Sum of Squares of Residuals (MSE): This measures the variance within the residuals (error term).

Formula for F-Statistic

F = MSR/MSE = (SSR/k)/(SSE/n-k-1)

Where:

SSR (Sum of Squares for Regression): The total variation explained by the regression model i,e ∑(Predicted Value from Regression model−Mean of the Dependent variable)^2
SSE (Sum of Squares for Error): The total variation not explained by the regression model i,e ∑(Actual Value from Regression model−∑(Predicted Value from Regression model)²
k: Number of independent variables.
n: Number of observations.

Interpreting the F-Statistic

High F-Statistic: A high F-value indicates that the model explains a significant portion of the variance in the dependent variable, suggesting that the regression model is a good fit.
Low F-Statistic: A low F-value suggests that the model does not explain the variance in the dependent variable well, indicating a poor fit.

AIC and BIC

When building linear regression models, selecting the best model among several candidates is crucial. Two common criteria for model selection are the Akaike Information Criterion (AIC) and the Bayesian Information Criterion (BIC). Both AIC and BIC help in balancing the goodness-of-fit of the model with the complexity of the model. Here’s a detailed explanation of both:

AIC: It is a measure used to compare different models and select the one that best balances the fit and complexity. It is based on the concept of entropy, which measures the amount of information lost when a model is used to represent the process that generated the data. Lower AIC indicates a better model. It means the model has a good fit with fewer parameters.

BIC: BIC, like AIC, is used for model selection. It incorporates a penalty for the number of parameters in the model but penalizes more heavily than AIC. BIC is derived from a Bayesian perspective. Similar to AIC lower BIC indicates a better model.

coef

Here we have coefficient for const and land area as -4.606e+05 and 1.17e+04 so if I say Price=b0+b1*land_area

It will be Price=(-4.606e+05)+(1.17e+04*land_area)

std err

It shows accuracy for each prediction. Lower the std error better the estimates.

t & p(t)

It shows value for t statistics and p value. It involves hypothesis. It answers question as Is it a useful variable or does it help us to explain variability we have in this case. As we know p value <0.05 is considered as variable significant. And we can say in our example “land_area” is significant predictor while predicting “price”.

Statistical Tests for Model Assumptions

Omnibus Test

It checks whether the residuals (errors) of the model are normally distributed. Normally distributed residuals are a key assumption in linear regression.

Prob(Omnibus)

The p-value for the Omnibus Test. If this value is low (typically less than 0.05), it suggests that the residuals are not normally distributed.

Skewness

Measures the asymmetry of the distribution of residuals. A skewness close to 0 indicates a symmetric distribution, while a positive or negative skew indicates a distribution that is not symmetric.

Positive Skew: Tail on the right side is longer.
Negative Skew: Tail on the left side is longer.

Kurtosis

Kurtosis: Measures the “tailedness” of the residuals’ distribution.
High Kurtosis: More outliers than a normal distribution (heavy tails).
Low Kurtosis: Fewer outliers than a normal distribution (light tails).
Normal Kurtosis: Close to 3.

Durbin-Watson Statistic

Durbin-Watson: Tests for the presence of autocorrelation (correlation between residuals). Values range from 0 to 4.
Value of 2: No autocorrelation.
Values < 2: Positive autocorrelation.
Values > 2: Negative autocorrelation.

Jarque-Bera Test

Tests whether the residuals are normally distributed based on skewness and kurtosis.

Prob(JB)

The p-value for the JB Test. A low value (typically less than 0.05) suggests that the residuals are not normally distributed.

Condition Number

Measures the sensitivity of the regression model’s predictions to small changes in the input data.

Low Condition Number: Suggests a stable model.
High Condition Number: Indicates potential multicollinearity (high correlation between independent variables) and an unstable model.

Conclusion

In this blog, we’ve covered the essential assumptions of linear regression, implemented a model, performed simple EDA, checked for assumptions using OLS in Python, and explained how to interpret the results. Remember, we’ve just scratched the surface. There’s a lot more to learn about linear regression, and we’ll delve deeper into each aspect in upcoming blogs. Stay tuned to uncover more mysteries of statistics and machine learning!

Happy exploring and happy predicting!