ml-clustering-proj

wholesale-clustering

Purpose

Clasifying sales data with given 8 features and giving appropriate sales idea with the result.
Use unsupervised learning – Hierarchical Clustering, DBSCAN, K-Means Clustering.

File Directory

wholesale-clustering
├── data
│   └── Wholesale customers data.csv               # 연간 고객 판매량 데이터 csv
├── wholesale-kmeans.ipynb                         # 반소희, 이서린 : K-Means Clustering
├── wholesale_hierarchical.ipynb                   # 장아연 :  Hierarchical Clustering
└── wholesale_DBScan.ipynb                         # 이희원, 최지민 : DBSCAN

Instructions

Open colab by clicking links below
Set Hardware accelerator to GPU (Runtime - Change runtime type - Hardware accelerator : GPU)
Run all codes (shortcut: cmd/ctrl + F9)

Methods

load data
check missing value
- As there was no missing value, skipped this step.
preprosessing
- numeric feature
  - normalization
    - For numeric features, there are some well known data preprosessing, standardlization and normalization.
    - Our numeric features need to be normalized so we used minmax scaler for min-max normalization.
  - outliers
    - After normalization, we check byplot of each features and we can find out there are some outliers-based on IQR.
    - So we replaced outliers with Q3 value(75% of max value) of each feature.
- categorical feature
  - Use one-hot encoding to train machine learning model.
Find Optimal K
- Elbow Method
- Silhoutte Coefficient
  - To find our optimal K, we used 2 methods mentioned above.
  - For k in range 2 to 15, we drawed graph to check it visually.
  - Through these two method, we could get Optimal K = 6 in common.
Clustering with 6 clusters
- Use KMeans, DBSCAN, Hierarchical clustering to clustering data.
- Preceed clustering with k = 6.
- There are 2 nominal features, Region and Channel, and we can get 6 combinations from those features.
- And we now can see that those 6 combinations corrrspond to 6 labels one-on-one.

Analysis Clusters

6 clusters are created.

KMeans Hierarchical

Clusters
Result of each cluster.

Results of wholesale clustering

Reasons for DBSCAN failure

DBSCAN is sensitive to density of data.
Data in each column are gathered in short range of sales amount (7000-12000).

Density plot
Data are gathered in high and even density.
It is hard to find appropriate epsilon and MinPts.
- Because data are gathered in high density, slight change in hyperparameter occurs huge change in clustering result.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ml-clustering-proj

wholesale-clustering

Purpose

File Directory

Instructions

Methods

Analysis Clusters

Reasons for DBSCAN failure

Popular repositories Loading

Repositories

People

Top languages

Most used topics

Results of wholesale clustering