Predicting credit limits is crucial for financial institutions to assess risk and make informed lending decisions. This project focuses on using regression models to predict credit limits based on various customer attributes.
Before training our model, we preprocess the data to ensure its quality and suitability for analysis. This includes removing irrelevant columns, handling duplicate data, encoding categorical features, missed values imputation, outlier detection, scaling, and splitting the data into training and testing sets.
Missing data is a common issue in datasets and can significantly impact model performance. We employ strategies such as imputation based on mean or median values and leveraging relationships between features to fill missing data appropriately.
Outliers can skew model predictions and affect the overall accuracy. We identify and remove outliers using techniques like Local Outlier Factor (LOF) to ensure robust model training.
Selecting the most relevant features is crucial for model efficiency and interpretability. We use techniques like correlation analysis and feature importance scores to select the most informative features for our regression model.
Feature | Importance | VIF | |
---|---|---|---|
0 | Avg_Utilization_Ratio | 0.420781 | 6.062172 |
1 | Income_Category | 0.239581 | 8.630447 |
2 | Total_Revolving_Bal | 0.163871 | 7.174995 |
3 | Card_Category | 0.061275 | 1.296993 |
4 | Total_Trans_Amt | 0.016861 | 8.821232 |
5 | Total_Amt_Chng_Q4_Q1 | 0.015222 | 16.236073 |
6 | Total_Ct_Chng_Q4_Q1 | 0.014307 | 15.424124 |
7 | Total_Trans_Ct | 0.012174 | 25.031854 |
8 | Customer_Age | 0.010757 | 76.824692 |
9 | Months_on_book | 0.010366 | 57.213571 |
10 | Total_Relationship_Count | 0.007021 | 7.831607 |
11 | Education_Level | 0.006195 | 3.251686 |
12 | Contacts_Count_12_mon | 0.005856 | 5.587047 |
13 | Dependent_count | 0.004754 | 4.209314 |
14 | Months_Inactive_12_mon | 0.004551 | 6.314962 |
15 | Marital_Status | 0.004440 | 8.258021 |
16 | Gender | 0.001987 | 4.977622 |
We train regression models, such as RandomForestRegressor, on the preprocessed data to predict credit limits. We fine-tune model parameters and evaluate performance to ensure optimal results.
Various regression models are trained and evaluated, including:
- Random Forest Regression
- Linear Regression
- Ridge Regression
- Polynomial Regression
We evaluate model performance using metrics such as Mean Squared Error (MSE) and R-squared.
- Mean Squared Error (MSE): Measures the average squared difference between the actual and predicted values.
- R-squared (R2) Score: Represents the proportion of the variance in the dependent variable that is predictable from the independent variables.
Model | Train MSE | Train R2 | Test MSE | Test R2 |
---|---|---|---|---|
Random Forest | 1.253357e+06 | 0.982717 | 1.266337e+07 | 0.851246 |
Polynomial Regression | 1.012446e+07 | 0.860389 | 3.010598e+07 | 0.646350 |
Linear Regression | 2.948030e+07 | 0.593483 | 3.622185e+07 | 0.574508 |
Ridge Regression | 2.948209e+07 | 0.593458 | 3.627357e+07 | 0.573901 |
In conclusion, this project demonstrates the application of regression models for credit limit prediction. By preprocessing data, handling missing values and outliers, and selecting informative features, we build robust models that can assist financial institutions in making informed lending decisions.
Contributions to this project are welcome! If you have any suggestions, improvements, or bug fixes, feel free to open an issue or submit a pull request.