WH-DS-ASSESSMENT

Libraries

Standard libraries:

pickle - implements binary protocols for serializing and de-serializing Python object structures
argparse - makes it easy to write user friendly command-line interfaces
time - module provides various time-related functions
os - module provides a portable way of using operating system dependent functionalities

Third Party libraries

mlflow - Open source platform for managing end-to-end machine learning lifecycle, used in this project for experiment tracking
pandas - Python high performance, easy-to-use data structures and data analysis tool
numpy - Python fundamental package for scientific computing
matplotlib. - popular python data visualization library
Seaborn - Python data visualization library based on matplotlib
Scipy - Python open-source library for scientific computing
Scikit-learn - popular python machine learning framework
Xgboost - An optimized distributed gradient boosting library designed to be highly efficient, flexible and portable.
LightGBM - fast distributed gradient boosting framework that uses tree-based learning algorithms
hyperopt - distributed asynchronous hyperparameter optimization libraries based on bayersian optimization algorithms. For this project we use TPE = Tree of Parzen Estimators

Folder Structure

.venv - Virtual environment folder

deployment/ - contains deployment code and artefacts

data/ - folder for validation data (.csv); you can also provide file path to the holdout validation dataset
formats/ - contains two files: * load_format.csv: file format to make sure that user provided file has the same csv headers as the dataset * train_format.csv: contains filtered columns of the dataset used for training i.e. after dropping some features
models/ - contains the pickle files for both the lightgbm and xgboost models
predictions/ - where the prediction csv files are stored after running predictions
- prediction are in the format predictions_model_type_timestamp e.g: prediction_xgb_10303030393040.csv

training/ - contains training code and artefacts

data/ - contains the training data in CSV
mlruns/ - mlflow logs from experiment runs
models/ - training model artefacts folder
mlflow.db - SQLite DB for MLflow tracking backend
wallet-hub-assignment - Jupyter notebook used for training and experimentation

.gitignore - git ignore file

README.md - Read me file

requirements.txt - third requirements file

Data Loading

Possible key improvements

Would have used Dask or cuDF distributed GPU based data structures and analysis due to large dataset. Decided to stick to pandas since compute resources were limited

Data Cleaning

Key decisions

Extreme null columns removal - dropped columns that have more than $55%$ of their values as nulls
Extreme zeros columns removal - dropped columns that have more $79%$ zero values
fill null values - used mean strategy to fill null values

Possible key improvement(s)

Possible experiment with other fillna strategies e.g. median, KNN and see if it improves the model(s) accuracy or rmse
Increase the threshold for null columns removal but $0.55 (55%)$ seems to work fine

Key decision

Used random forest to calculate feature importances
Computed as the average impurity decrease from all decision trees in the forest without making the assumption about whether the data is linearly separable or not

Improvements

Would have experiment with other:
unsupervised feature data compression algorithms such as Principal Component analysis and see if it improves model performance - sort of ignored this approach since I was using gradient boosted tree-based models
supervised data compression algorithms such as Linear Discriminant Analysis
t-distributed stochastic neighbor embedding

Limitations of RandomForest feature importances

In the case of two or more features that are highly correlated, one feature may be ranked very highly while the information on the other features may not be fully captured. On the other hand since our usecase is more concerned with accuracy we don't need to bother.

Hyperparameter tuning

Key decision - hyperparameter tuning

Used Hyperopt for hyperparameter tuning using the Tree-Structured Parzen estimators - TPE method, TPE is a bayesian optimization method based on probabilistic model that is continuously updated based on past hyperparameter evaluations and the associated performance scores instead of regarding these evaluations as independent events. More on TPE check out Algorithms for Hyper-Parameter Optimization. Bergstra J, Bardenet R, Bengio Y, Kegl B. NeurIPS 2011. pp. 2546–2554, paper

Improvements

Would have used k-fold instead of hold-out cross-validation with hyperopt on distributed GPU for hyperparameter optimization

Model Experiments

Used mlflow to track experiments, as part of my submission is experiments.csv exported from mlflow

Algorithm used and resulting metric after tuning

Accuracy - if the absolute error of prediction is greater than 3.0, we regard the prediction as wrong else correct

Lasso regression - (best params)

$r2_score=0.826$
$rmse=49.48$
$validation_accuracy = 0.062$

Ridge Regression - (best params)

$r2_score=0.826$
$rmse=49.48$
$validation_accuracy = 0.061$

Xgboost - (best params)

$validation_r2_score=0.949$
$validation_rmse=26.56$
$validation_accuracy = 0.147$
$test_r2_score = 0.948$
$test_rmse=27.094$
$test_accuracy = 0.152$

lightgbm - (best params)

$validation_r2_score=0.950$
$validation_rmse=26.37$
$validation_accuracy = 0.143$
$test_r2_score = 0.949$
$test_rmse=26.801$
$test_accuracy = 0.142$

Decision table - XGBoost, LightGBM

Parameters	XGBoost	LightGBM
Model size	8.94MB	17.01MB
Prediction-time (100k datapoints)	0.84s	3.53s
Prediction-accuracy (100k datapoints)	0.22	0.20
validation-accuracy (9k datapoints)	0.147	0.143
Test-accuracy (10k datapoints)	0.152	0.142
Prediction-RMSE (100k datapoints)	17.55	17.82
validation-RMSE(9k datapoints)	26.56	26.37
Test-RMSE (10k datapoints)	27.094	26.801

Key decisions

Chose Xgboost as default model, on average it is a better model than LightGBM based on the table above

Running scoring script

Activate the virtual environment (.venv):

On mac: run source .venv/bin/activate
On windows: run .\.venv\bin\activate

Install libraries - just incase:

first update pip: run pip install --upgrade pip
then install third party libraries: run pip install -r requirements.txt

Install lightgbm:
- on mac: run brew install lightgbm
Finally run python deployment/predict.py --file-path {validation_file_path} --model {lgb|default=xgb}
Check the deployment/predictions folder for the result csv file

prediction file_format prediction_{model_type}_{timestamp}

General improvements

Deploy scoring script using prefect
Obtain more compute resources (GPU), for faster experimentation and more indepth hyperparameter tuning
Since Accuracy is the main focus over intrepretability, I could increase accuracy by Stacking Xgboost, LightGBM and a Neural Network
Could integrate automated test using Pytest

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
deployment		deployment
training/experiment_tracking		training/experiment_tracking
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
wallethub_report.ipynb		wallethub_report.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

WH-DS-ASSESSMENT

Libraries

Folder Structure

Data Loading

Data Cleaning

Hyperparameter tuning

Model Experiments

Running scoring script

General improvements

About

Releases

Packages

Languages

nebanat/WH-DS-ASSESSMENT

Folders and files

Latest commit

History

Repository files navigation

WH-DS-ASSESSMENT

Libraries

Folder Structure

Data Loading

Data Cleaning

Hyperparameter tuning

Model Experiments

Running scoring script

General improvements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages