Wish Sales Projection: Project Overview

Created models to predict the sales of items listed on the eCommerce site Wish given listed items and factors pertaining to the sales
Performed feature engineering on various columns to help the model to perform better
Optimized K Nearest Neighbor, Support Vector Machines, Decision Trees, Random Forest, and XGboost using GridSearchCV to reach the best model
Utilized an image recognition API to process image data of listed items and add additional information

Code and Resources Used

Python Version: 3.7.6

Packages: pandas, numpy, scikit-learn, matplotlib, seaborn

Kaggle Dataset: https://www.kaggle.com/jmmvutu/summer-products-and-sales-in-ecommerce-wish

Image Categorization API: https://imagga.com/auto-categorization-demo

Data Cleaning

After acquiring the data, I needed to clean it so that it was usable for the model. I made the following changes:

Added column with the number of other listings the merchant has in the data
Cleaned various NaN and null entries within the dataset
Reduced dimensionality of color and size columns
Removed interdependent variables
Processed tag variable using one hot encoding

Exploratory Data Analysis

Model Building

A train-test split was performed on the dataset with a test size of 20%. Furthermore, k-fold cross validation was utilized as a means of estimating the in-sample accuracy with k = 10.

At this stage, performance was evaluated based on accuracy.

Five machine learning algorithms were considered for this data including:

K Nearest Neighbor - Utilized as the dataset was not extremely large and thus computationally expensive
Support Vector Machine - Considered due to needing a classifier
Decision Trees - Used due to the class nature of many independent and dependent variables
Random Forest - Ensemble method for decision trees
XGBoost - Ensemble method more optimized for performance

Model Performance

Initial Model Accuracy

Using default hyperparameters, I tested the various models considered.

K Nearest Neighbor: Train Accuracy = 46%
SVM with Linear Kernel: Train Accuracy = 40.86%
Decision Trees: Train Accuracy = 48.25%
Random Forest: Train Accuracy = 49.87%
XGBoost: Train Accuracy = 46.98%

The Random Forest performed the best of the all the models tested.

Hyperparameter Optimization

After determining the best model to be Random Forest classifier, I used GridSearchCV to tune the hyperparameters of the model.

I considered max depth, max features, min samples leaf, min samples split, and n estimators as the hyperparameters to be optimized.

Following GridSearchCV, the model had an accuracy of 52.70%.

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
.ipynb_checkpoints		.ipynb_checkpoints
Data		Data
EDA_Images		EDA_Images
README.md		README.md
data_cleaning.py		data_cleaning.py
data_eda.ipynb		data_eda.ipynb
model_building.py		model_building.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wish Sales Projection: Project Overview

Code and Resources Used

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

Initial Model Accuracy

Hyperparameter Optimization

About

Releases

Packages

Languages

pjs1221/wish_sales_proj

Folders and files

Latest commit

History

Repository files navigation

Wish Sales Projection: Project Overview

Code and Resources Used

Data Cleaning

Exploratory Data Analysis

Model Building

Model Performance

Initial Model Accuracy

Hyperparameter Optimization

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages