You can also find all 70 answers here π Devinterview.io - Linear Regression
Linear Regression is a statistical method to model the relationship between two numerical variables by fitting a linear equation to the observed data. This calibrated line serves as a predictive model to forecast future outcomes based on input features.
-
Dependent Variable: Denoted as
$Y$ , the variable being predicted. -
Independent Variable(s): Denoted as
$X$ or$\vec{x}$ , the predictor variable(s). -
Coefficients: Denoted as
$B_0$ (intercept) and$B_i$ (slopes), the parameters estimated by the model. - Residuals: The gaps between predicted and observed values.
-
Linearity: The relationship between
$X$ and$Y$ is linear. -
Independence: Each input feature
$X_i$ is independent of one another. - Homoscedasticity: Consistent variability in residuals along the entire range of predictors.
- Poor Performance in the Presence of Outliers: Sensitive to outliers during model training.
- Ordinary Least Squares (OLS): Minimizes the sum of squared differences between observed and predicted values.
- Coordinate Descent: Iteratively adjusts coefficients to minimize a specified cost function.
-
Coefficient of Determination (
$R^2$ ): Measures the proportion of the variance in the dependent variable that is predictable from the independent variables. - Root Mean Squared Error (RMSE): Provides the standard deviation of the residuals, effectively measuring the average size of the error in the forecasted values.
- Sales Forecasting: To predict future sales based on advertising expenditures.
- Risk Assessment: For evaluating the potential for financial events, such as loan defaults.
- Resource Allocation: To determine optimal usage of resources regarding output predictions.
While both simple linear regression (SLR) and multiple linear regression (MLR) predict a dependent variable using one or more independent variables, their methodologies and outcomes diverge in crucial ways.
-
SLR: Utilizes a single independent variable to predict the dependent variable mathematically.
- Equation:
$y = \beta_0 + \beta_1 x$ - Example: Home Price Prediction Based on Area
- Equation:
-
MLR: Involves multiple independent variables and their respective coefficients in a linear equation.
- Equation:
$y = \beta_0 + \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$ - Example: Salary Prediction Based on Experience, Education, and Job Type
- Equation:
-
SLR: Conceptually simpler, with a straightforward relationship between the dependent and independent variable.
- Visual: Points form a line on a scatter plot.
-
MLR: Offers more nuanced predictions since the dependent variable is influenced by multiple factors.
- Visual: Points can be mapped in higher-dimensional spaces; not all factors might be visually represented.
-
SLR: Generally easier to optimize due to the presence of only two coefficients.
- Objective: Minimize sum of squared residuals.
-
MLR: May require more sophisticated techniques to optimize and manage risk of overfitting.
- Objective: Minimize sum of squared residuals or include other regularization terms.
-
SLR: Tends to be more resistant to overfitting as it is less prone to including unnecessary variables.
-
MLR: Can be more susceptible to overfitting, stemming from the inclusion of potentially redundant or unrelated independent variables.
-
SLR: Ideal when relationships are truly linear and a single independent variable is deemed sufficient.
- Example: Examining the Impact of Study Time on Test Scores.
-
MLR: Suited for scenarios where multiple independent variables are believed to contribute collectively to the dependent variable.
- Example: Predicting House Prices Using Features like Area, Bedrooms, and Bathrooms.
Here is the Python code:
from sklearn.linear_model import LinearRegression
# Dataset
X = [[1, 2, 3], [4, 5, 6], [7, 8, 9]]
y = [10, 11, 12]
# Fit the model
regressor = LinearRegression()
regressor.fit(X, y)
# Predict
new_data = [[1, 2, 3]]
prediction = regressor.predict(new_data)
# Output
print(f"Prediction: {prediction}")
Linear regression makes several key assumptions about the nature of the relationship between the independent and dependent variables for the model to be valid.
-
Linear Relationship: There is a linear relationship between the independent variables
$X$ and the dependent variable$Y$ . -
Multivariate Normality: The error term,
$\varepsilon$ , is normally distributed. -
Homoscedasticity: The variance of the residual (
$y-\hat{y}$ ) is the same for any value of independent variable$X$ . -
Independence of Errors: The observations are independent of each other. This is commonly checked by ensuring there's no pattern in the way the data was collected over time.
-
No or Little Multicollinearity: The features should be linearly independent. Having high multicolliearity, i.e., linear relationship between the features, can skew the interpretation of model coefficients.
-
Homogeneity of Variance: This refers to homoscedasticity, that is, the variance of residuals is constant across all levels of the independent variables.
-
No Auto-correlation: When dealing with time series data, one must ensure that the current observation is not correlated with any of the previous observations.
-
Feature Normalization: This is not exactly an assumption, but is generally recommended for better convergence during the optimization process.
Here is the Python code:
import pandas as pd
import numpy as np
from sklearn.datasets import load_boston
from statsmodels.stats.outliers_influence import variance_inflation_factor
import statsmodels.api as sm
from statsmodels.compat import lzip
import matplotlib.pyplot as plt
# Getting the data
boston = load_boston()
boston_df = pd.DataFrame(boston.data, columns=boston.feature_names)
# Adding the target variable
boston_df['PRICE'] = boston.target
# Visualizing Correlation Matrix
correlation_matrix = boston_df.corr().round(2)
plt.figure(figsize=(12, 8))
sns.heatmap(data=correlation_matrix, annot=True, cmap='coolwarm')
plt.show()
# Calculating Multicollinearity using VIF
X = boston_df.drop('PRICE', axis=1)
variables = X.values
vif = pd.DataFrame()
vif["VIF"] = [variance_inflation_factor(variables, i) for i in range(variables.shape[1])]
vif["Features"] = X.columns
# Displaying the results
print(vif)
# Performing Multiple Linear Regression and Diagnostics
X = sm.add_constant(X)
model = sm.OLS(boston_df['PRICE'], X).fit()
print(model.summary())
# Plotting the residuals
plt.figure(figsize=(12, 8))
plt.scatter(model.fittedvalues, model.resid)
plt.xlabel('Fitted values')
plt.ylabel('Residuals')
plt.title('Fitted Values vs Residuals')
plt.grid()
plt.show()
The coefficients in a linear regression model provide insights into the strength and direction of the relationship between the input features and the target variable.
In multiple linear regression, the jth coefficient
- Continuous Predictor: For an increase of one unit in phone hours, the model predicts a 0.45-unit increase in exam score.
- Dichotomous Predictor: If "Male" is 1, the intercept represents the estimated average exam score for males; for females (coded as 0), the model predicts the intercept alone.
- Ordinal Predictor: A score of 1 corresponds to a 4.3-unit change, while each extra level increases the predicted score by 2.1 units.
-
Nominal Predictor with K Categories: A category-specific coefficient represents the change in predicted
$Y$ when that category is present, compared to the reference category.
For "City" (reference category),
-
Interaction Terms: For two interacting predictors like "StudyHours" and "InternetAccess," the coefficient represents the additional change in
$Y$ for each 1-unit change of both predictors simultaneously.
Here, an additional 0.15 units of
Here is the Python code:
import pandas as pd
from sklearn.linear_model import LinearRegression
# Create sample data
data = {
'StudyHours': [5, 6, 3, 7, 4],
'InternetAccess': [1, 1, 0, 1, 0],
'Score': [80, 85, 60, 90, 70]
}
df = pd.DataFrame(data)
# Fit the linear regression model
model = LinearRegression().fit(df[['StudyHours', 'InternetAccess']], df['Score'])
# Get the coefficients
coeff_study = model.coef_[0]
coeff_internet = model.coef_[1]
coeff_interaction = model.coef_[0] * model.coef_[1]
# Print the coefficients
print("Study Hours Coefficient:", coeff_study)
print("Internet Access Coefficient:", coeff_internet)
print("Interaction Coefficient (StudyHours * InternetAccess):", coeff_interaction)
The intercept term in linear regression, often denoted as
-
Mathematical Function Completion: The intercept ensures a linear equation isn't constrained to pass through the origin. Without it, the equation becomes
$y = \beta_1 x_1 + \beta_2 x_2 + \ldots + \beta_n x_n$ . -
Elimination of Multicollinearity Issues: If all predictors are zero, the intercept uniquely represents the mean or baseline of the response variable. This trait helps mitigate multicollinearity problems in the model. If the intercept wasn't present in such scenarios, it would be challenging to discern the individual effects of predictors.
-
Positivity Constraints: The intercept accommodates for scenarios where the response variable can't be negative even if all predictors are zero.
-
Statistical Interpretation: The intercept enables statistical interpretation of coefficients by providing a clear reference point or base value.
In many cases, including an intercept is necessary; however, there are situations where it could be omitted.
-
Rationale for Exclusion: If there's concrete domain knowledge indicating that the regression line should pass through the origin, the intercept can be removed.
-
Generally Includes Intercept: Most machine learning libraries and algorithms automatically include the intercept unless specifically instructed otherwise.
Here is the code:
from sklearn.linear_model import LinearRegression
# Creating a linear regression model without the intercept
model_no_intercept = LinearRegression(fit_intercept=False)
# Fitting the data
model_no_intercept.fit(X, y)
When evaluating a Linear Regression model, it's vital to use the right metrics to assess its performance.
- Mean Absolute Error (MAE): Gives the average absolute deviation from the predicted values. It's easy to interpret but might not effectively penalize large errors.
- Mean Squared Error (MSE): Squares the deviations, giving higher weight to larger errors. It's useful for indicating the presence of outliers.
A drawback is that the units of the error are squared the units of the target.
- Root Mean Squared Error (RMSE): Simply the square root of the MSE. It's in the same unit as the target, making it more interpretable.
- Coefficient of Determination (R-squared): Provides a measure of how well the independent variables explain the variability of the dependent variable. It ranges from 0 to 1, with 1 indicating a perfect fit.
where
- Adjusted R-squared: Accounts for the number of predictors in the model and provides more reliable results, especially with several predictors.
Here,
- Mean Percentage Error (MPE) and Symmetric Mean Absolute Percentage Error (SMAPE): For models involving percentages or those needing to account for under and over-predictions.
- Mean Absolute Percentage Error (MAPE): Provides relative error rates, making it useful for understanding the accuracy in practical terms.
- Akaike Information Criterion (AIC) and Bayesian Information Criterion (BIC): These help select among competing models by adjusting the goodness-of-fit for the number of model parameters.
Here,
Homoscedasticity is an essential assumption in several statistical techniques, including Linear Regression. It ensures that the variance of the residuals between independent and dependent variables remains constant across all levels of the independent variable.
Without this assumption, certain issues can arise:
- Inflated Standard Errors: If the variance isn't consistent across the independent variable, you can't accurately assess the uncertainty in the estimated coefficients.
- Biased Coefficients: The coefficients could be systematically over or underestimated.
- Invalid Hypothesis Testing: This can affect the p-values associated with the coefficients and potentially cloud judgements regarding their significance.
By checking for homoscedasticity, researchers can ensure their models meet this critical assumption.
Multicollinearity refers to high intercorrelations among independent variables in a regression model. This can cause a range of problems affecting the model's interpretability and reliability.
-
Inflated Standard Errors: Multicollinearity leads to inaccurate standard errors for coefficient estimates, making it more challenging to identify true effects.
-
Unstable Coefficients: Small changes in the data can lead to significant changes in the coefficients. This adds to the difficulty in interpreting the model.
-
Biased Estimates: Coefficients can be pushed towards opposite signs, which might lead to incorrect inferences.
-
Reduced Statistical Power: The model's ability to detect true effects is compromised, potentially leading to type 2 errors.
-
Contradictory Results: The signs of the coefficient estimates might be inconsistent with prior expectations or even exhibit 'sign flipping' upon minor data changes.
Several methods can be utilized, such as:
-
Correlation Matrix: Visual inspection of correlations or the use of correlation coefficients, where values closer to 1 or -1 indicate high levels of multicollinearity.
-
Variance Inflation Factor (VIF): A VIF score above 5 or 10 is often indicative of multicollinearity.
-
Tolerance: A rule of thumb is that a tolerance value below 0.1 is problematic.
-
Eigenvalues: If the eigenvalues of the correlation matrix are close to zero, it's a sign of multicollinearity.
-
Variable Selection: Remove one of the correlated variables. Techniques such as stepwise regression or LASSO can help with this.
-
Combine Variables: If appropriate, two or more correlated variables can be merged to form a single factor using methods like principal component analysis or factor analysis.
-
Increase the Sample Size: This can sometimes help in reducing the effects of multicollinearity.
-
Regularization: Techniques like Ridge Regression explicitly handle multicollinearity.
-
Resampling Methods: Bootstrapping and cross-validation can sometimes mitigate the problems arising from multicollinearity.
-
Domain Knowledge: Understanding the context of the problem and the relationship between variables can help in identifying multicollinearity.
-
Balance: While multicollinearity can be problematic, its complete absence might suggest that one or more important variables are not included in the model. Therefore, an ideal solution would be to manage the effects, rather than eliminate the multicollinearity altogether.
Hypothesis testing validates the significance of individual predictor (independent) variables and the model as a whole in the context of linear regression.
- Null Hypothesis: The coefficient of the variable is zero, and the variable has no relation with the response variable.
- Alternative Hypothesis: The coefficient of the variable is nonzero, indicating the variable contributes to the model.
Positive or negative coefficients reflect the expected relation. Statisticians often use a p-value threshold (e.g., 0.05) to accept or reject the null hypothesis.
- Null Hypothesis: All slope coefficients are zero, indicating none of the variables are contributing to the model.
- Alternative Hypothesis: At least one slope coefficient is nonzero.
The test uses an F-statistic and its associated p-value to determine the model's overall significance. A p-value threshold of 0.05 is commonly employed.
Certain tests, such as the Goldfeld-Quandt Test and Durbin-Watson Test, provide insights into the regression model's validity.
- The Goldfeld-Quandt Test evaluates the homoscedasticity assumption.
- The Durbin-Watson Test identifies potential autocorrelation issues in the model.
Normality of residuals is a fundamental assumption of linear regression. It ensures that your model accurately reflects the relationship between variables and that results are reliable.
-
Residuals: These are the differences between observed values and the values predicted by the model. They serve as a yardstick for assessing model performance.
-
Normal Distribution: A symmetrical, bell-shaped curve characterizes a normal distribution, with the mean, median, and mode sharing the same value.
-
Importance in Linear Regression: Residuals should be normally distributed to validate key assumptions of linear regression, such as constant variance and independence.
-
Histogram: Visualizes the distribution of residuals. A bell curve indicates normality.
-
Quantile-Quantile (Q-Q) Plot: Compares the quantiles of the residuals to those of a normal distribution. Ideally, the points align with a 45-degree line, indicating normality.
-
Kolmogorov-Smirnov (K-S) Test: A statistical test to compare the distribution of sample data to a theoretical normal distribution.
-
Shapiro-Wilk Test: Another statistical test to assess if a dataset comes from a normal distribution.
Here is the Python code:
import numpy as np
import matplotlib.pyplot as plt
import scipy.stats as stats
# Generate example data
np.random.seed(0)
observed = np.random.normal(0, 1, 100)
predicted = np.random.normal(0, 1, 100)
# Calculate residuals
residuals = observed - predicted
# Visualize residuals
plt.hist(residuals, bins=10)
plt.show()
# Q-Q plot
stats.probplot(residuals, dist="norm", plot=plt)
plt.show()
# Shapiro-Wilk test
shapiro_stat, shapiro_p = stats.shapiro(residuals)
is_normal = shapiro_p > 0.05 # Assuming a 5% significance level
Data preprocessing is essential for successful linear regression. This process involves steps such as loading the data, handling missing values, and feature scaling.
Ensure data correctness and completeness:
- For small datasets, manual checks are viable.
- For large datasets, algorithms can help identify inconsistencies or outliers.
Strategies to Handle Missing Data:
- Drop records or fields: Useful when missing data is minimal.
- Mean/Median/Mode Imputation: Substitute missing values with the mean, median, or mode of the feature.
- Predictive models: Use advanced ML models to predict and fill missing values.
Why Normalize or Standardize:
- It can make the learning process faster.
- It ensures that no single feature dominates the others.
Scale Features Based on Data Distribution:
- For normal distributions: StandardScaler
- For non-normal distributions: MinMaxScaler
Reason for Encoding Categorical Variables:
- Most ML algorithms are designed for numerical inputs.
- In linear regression, a categorical variable with more than two categories often becomes multiple dummy variables.
- Remove irrelevant features.
- Turn qualitative data into quantitative form.
- Create new features by combining existing ones.
The primary purpose of splitting data is to assess model performance. Data is generally split into training and testing sets, often in an 80-20 or 70-30 ratio.
VIF: A measure to identify multicollinearity. It assesses how much the variance of an estimated regression coefficient increases if the predictors are correlated.
Correlation Matrix: review pairwise correlations between variables, typically using Pearson correlation coefficient.
Handling missing values in datasets is crucial for accurate and reliable modeling. Adverse consequences, such as biased coefficient estimates, reduced statistical power, and poor model generalization, can result from leaving them unaddressed in linear regression. Here are different techniques to deal with such instances.
-
Complete-Case Analysis
- Approach: Discard records with any missing values.
- Considerations: This method is simple but may lead to information loss and potential bias.
-
Pairwise Deletion
- Approach: Use available data for each pair of variables in an analysis, ignoring missing values in other variables.
- Considerations: It can introduce computational complexities and potentially biased results in regression.
-
Mean/Median/Mode Imputation
- Approach: Use the mean, median, or mode of the observed data to replace any missing value for the feature before fitting the model.
- Considerations: This method can distort statistical measures and relationships between variables. For linear regression, it can especially affect the coefficients and their interpretations.
-
Regression Imputation
- Approach: Predict missing values using a regression model for the variable that has missing data, using other variables as predictors.
- Considerations: Can be computationally intensive and lead to biased or inflated R-squared values if the relationship between predictor and response variables used in the imputation model mimics the relationship between the independent and dependent variables of the final regression model.
-
K-nearest Neighbors (KNN) Imputation
- Approach: For each missing value, impute the average of K nearest data points.
- Considerations: Parameter tuning may be necessary, and the method might not perform optimally in high dimensional spaces.
-
Multiple Imputation
- Approach: Impute missing values multiple times to generate multiple complete datasets and combine results for robust estimates and confidence intervals.
- Considerations: Requires statistical software and expertise but can provide more accurate and less biased results.
-
Dedicated Missing Value Algorithms
- Approach: Use algorithms specifically designed to handle missing data, such as missForest, MICE (Multivariate Imputation by Chained Equations), or GAIN (Generative Adversarial Imputation Networks).
- Considerations: These often offer advanced imputation strategies and can lead to improved model performance.
Before training a regression model, it's important to select the most relevant features. This process improves the model's interpretability and generalization.
Use Pearson's correlation coefficient to determine linear relationships between features and the target. Generally, features with high absolute correlation values (often above 0.5 or 0.7) are selected.
import pandas as pd
# Assuming 'df' is your DataFrame and 'target' is the target variable
corr = df.corr()
corr_with_target = corr[target].abs().sort_values(ascending=False)
This iterative technique either adds or removes features one at a time based on a chosen criterion, such as Bayesian Information Criterion (BIC) or Akaike Information Criterion (AIC).
from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
linreg = LinearRegression()
selector = RFE(linreg, n_features_to_select=2) # Choose desired number of features
selector.fit(X, y)
selected_features = X.columns[selector.support_]
L1 regularization introduces a penalty term calculated as the absolute sum of the feature coefficients. This encourages sparsity, effectively selecting features by setting some of their coefficients to zero.
from sklearn.linear_model import Lasso
lasso = Lasso(alpha=0.1) # Adjust alpha for desired regularization strength
lasso.fit(X, y)
selected_features = X.columns[lasso.coef_ != 0]
Algorithms like Decision Trees and their ensembles, such as Random Forest and Gradient Boosting, naturally quantify feature importance during the model building process.
from sklearn.ensemble import RandomForestRegressor
forest = RandomForestRegressor()
forest.fit(X_train, y_train)
feature_importance = forest.feature_importances_
This non-parametric method gauges the mutual dependence between two variables without necessarily assuming a linear relationship.
from sklearn.feature_selection import mutual_info_regression
mi = mutual_info_regression(X, y)
-
Multicollinearity: Avoid selecting highly correlated features as they can introduce multicollinearity, affecting coefficient interpretations.
-
Overfitting: Extensive feature choices can lead to overfitting, particularly with lesser sample sizes.
-
Feature Relevance: It may be pertinent in certain cases to include less important features for accurate model predictions and interpretability.
Feature scaling is a crucial step in many machine learning pipelines, including Linear Regression. It ensures that all input features have a consistent scale. This standardization can be achieved through techniques like min-max scaling or mean normalization.
-
Gradient Descent: The optimization algorithm may converge faster with scaled features (like convergence to a minimum cost).
-
Regularization: When features are on different scales, the regularization term can unfairly penalize certain features or attributes.
-
Evaluating Coefficients' Importance: Without scaling, the magnitude of coefficients can be misleading.
-
K-nearest Neighbors (KNN) and Support Vector Machines (SVM): These algorithms are sensitive to feature scales.
-
Principal Component Analysis (PCA): Feature scaling is necessary before applying PCA to standardize outputs.
Here is the Python code:
import numpy as np
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Load the diabetes dataset from sklearn
diabetes = datasets.load_diabetes()
X, y = diabetes.data, diabetes.target
# Split data into training and test sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Standardize the features
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)
# Train the model with unscaled and scaled data for comparison
reg_unscaled = LinearRegression().fit(X_train, y_train)
reg_scaled = LinearRegression().fit(X_train_scaled, y_train)
# Make predictions
y_pred_unscaled = reg_unscaled.predict(X_test)
y_pred_scaled = reg_scaled.predict(X_test_scaled)
# Evaluate the models
mse_unscaled = mean_squared_error(y_test, y_pred_unscaled)
mse_scaled = mean_squared_error(y_test, y_pred_scaled)
print("MSE (Unscaled):", mse_unscaled)
print("MSE (Scaled):", mse_scaled)
To accurately evaluate the performance of a machine learning model, it's important to select independent datasets for training and testing. This is where data splitting comes in.
- Performance Check: Testing accuracy indicates how well a model will predict on future data.
- Model Generalization Check: The test set evaluates if the model can accurately predict on data not seen during training.
Without distinct training and testing datasets, models can overfit, providing poor predictions on new data.
- 80/20: Common for larger datasets.
- 70/30: For datasets with 1000 or fewer instances.
- 60/40, 90/10: Specific uses like highly imbalanced classes.
- Process: Divide the dataset into two: a training set to fit the model and a test set to evaluate it.
- Use: Most common for simple models on larger datasets.
- Process: Divides data into k subsets, known as folds. Iteratively, each fold is used as the test set, while the rest train the model.
- Benefits: Utilizes all data points for training and testing, providing a more comprehensive evaluation.
- Drawbacks: Can be computationally expensive.
- Process: Extreme case of k-fold where each data point constitutes a fold.
- Use Case: Suitable for small datasets.
- Need: For datasets with class imbalance.
- Process: Ensures that each fold maintains the same proportion of classes as the full dataset.
- Benefit: It ensures each class is represented in each fold.
- Process: Replaces cross-validation folds with random subsets drawn multiple times.
- Benefits: Useful for very large datasets.
- Drawback: Variability in results due to randomness, which can affect model assessments.