The HDB Resale Price System is a python-based application designed to predict real-time resale price for a specified location. By utilising advanced machine learning models such as the Linear Regression and Random Forest, this system provides accurate forecasts for future public home prices. With comprehensive data integration that incorporates essential macroeconomic factors like the Consumer Price Index(CPI), the application offers users a holistic understanding of property price fluctuations. Hyperparameter tuning further enhances the performance and generalisation of the machine learning models, ensuring reliable predictions validated through extensive testing. This empowers individuals, real estate professionals, and stakeholders to make well-informed decisions in the dynamic realm of property prices, supported by transparent and dependable performance evaluation metrics. The application is hosted on Streamlit, offering an interactive web interface.
pandas
: For data manipulation and analysis.numpy
: For numerical computations.
matplotlib.pyplot
: For creating static, animated, and interactive visualizations.seaborn
: For data visualization based on matplotlib.pywaffle
: For creating waffle charts.joypy
: For visualizing distributions of variables using Joy plots.
statsmodels
: For estimating and interpreting models for statistical analysis.scipy.stats
: For statistical functions including spearmanr and pearsonr.
scikit-learn
: For implementing machine learning algorithms such as Linear Regression and Random Forest Regressor.GridSearchCV
: For hyperparameter tuning of machine learning models.
sklearn.metrics
: For model evaluation metrics such as R² score and mean absolute error.yellowbrick.regressor
: For visualization of model diagnostics.CooksDistance
,ResidualsPlot
: For identifying influential observations and plotting residuals of models.
StandardScaler
: For feature scaling.train_test_split
: For splitting the data into training and test sets.
joblib
: For saving and loading machine learning models.
In the development of our HDB Resale Price Predictor, various evaluation metrics were employed to assess the performance of the house pricing prediction models:
- R² Score: Used to measure the proportion of variance in the target variable explained by the predictors. This allowed comparison of the predictive power of different models:
- Linear Regression (with outliers): R² Score = 0.90
- Linear Regression (without outliers): R² Score = 0.87
- Random Forest (Out-of-bag): R² Score = 0.966
- Random Forest (K-fold Cross Validation): R² Score = 0.967
- Mean Absolute Error (MAE): Calculated for the Random Forest models to quantify the average magnitude of errors, providing a straightforward interpretation of the average prediction error.
- Correlation Coefficients (Spearman and Pearson): Employed to assess the relationship between predicted and actual resale prices, ensuring a thorough evaluation of model effectiveness.
Hyperparameter tuning was conducted, especially for the Random Forest model, to identify the optimal parameters, such as the number of trees in the forest and the maximum depth of each tree. This tuning aimed to maximize the model's predictive performance while avoiding overfitting or underfitting.
The final model chosen was the Random Forest with K-fold Cross-Validation, due to its superior predictive performance and robust evaluation methodology. This model's high R² score and strong correlation with true prices indicate its reliability and strong explanatory power for predicting HDB resale prices.
To set up this project locally:
- Clone the repository to your local machine.
- Navigate to the project directory.
- Install the required dependencies:
pip install -r requirements.txt
- Run the Streamlit application:
streamlit run streamlit_app.py
- Dataset source: HDB Resale Dataset
- Streamlit: Streamlit website