This project develops a sophisticated pricing model for auto insurance using statistical and machine learning techniques. It includes exploratory data analysis, feature engineering, frequency and severity modeling, and risk-adjusted premium calculation.
- Exploratory Data Analysis (EDA) of claim frequency and severity
- Advanced feature engineering to create novel risk factors
- Frequency modeling using Poisson, Negative Binomial, and Zero-Inflated Poisson models
- Severity modeling using log-normal regression
- Risk-adjusted premium calculation system
- Visualization of pricing impacts and risk factor analysis
- R
- Libraries: data.table, dplyr, ggplot2, MASS, pscl, caret
- EDA.R: Exploratory Data Analysis
- Feature_engineering.R: Creation of new risk factors
- Frequency_model.R: Claim frequency modeling
- Severity_model.R: Claim severity modeling
- Risk_analysis.R: Risk factor analysis and premium calculation
This visualization demonstrates the relationship between different risk factors and claim frequency. Key insights:
- Cars aged 6-10 years show the highest claim frequency
- Young drivers (18-25) have significantly higher claim rates
- Higher population density correlates with increased claim frequency
- Full year policies show different risk patterns compared to partial year coverage
The heat map reveals the interaction between driver age and vehicle age groups:
- Highest risk concentration (red) appears in young drivers (18-25) with vehicles aged 6-10 years
- More experienced drivers (45+) with newer vehicles show lower claim rates (blue)
- Clear pattern of risk reduction as driver age increases
- Vehicle age has a non-linear impact on risk across different driver age groups
Analysis of risk patterns across population density groups shows:
- Medium-High density areas have the highest risk score (2.45)
- Clear correlation between population density and risk
- Risk scores range from 1.80 to 2.45, showing significant variation
- The relationship is not perfectly linear, suggesting other factors influence risk in urban vs. rural areas
- 15% improvement in premium accuracy compared to baseline model
- Increase in model predictive power through novel risk factors
- Implement Generalized Additive Models (GAMs) for non-linear relationships
- Explore machine learning approaches like Random Forests or Gradient Boosting Machines
- Conduct competitive analysis and market basket analysis