- Clasifying sales data with given 8 features and giving appropriate sales idea with the result.
- Use unsupervised learning – Hierarchical Clustering, DBSCAN, K-Means Clustering.
├── data
│ └── Wholesale customers data.csv # 연간 고객 판매량 데이터 csv
├── wholesale-kmeans.ipynb # 반소희, 이서린 : K-Means Clustering
├── wholesale_hierarchical.ipynb # 장아연 : Hierarchical Clustering
└── wholesale_DBScan.ipynb # 이희원, 최지민 : DBSCAN
- Open colab by clicking links below
- Set
Hardware accelerator
to GPU (Runtime - Change runtime type - Hardware accelerator : GPU) - Run all codes (shortcut:
cmd/ctrl + F9
load data
check missing value
- As there was no missing value, skipped this step.
- numeric feature
- normalization
- For numeric features, there are some well known data preprosessing, standardlization and normalization.
- Our numeric features need to be normalized so we used minmax scaler for min-max normalization.
- outliers
- After normalization, we check byplot of each features and we can find out there are some outliers-based on IQR.
- So we replaced outliers with Q3 value(75% of max value) of each feature.
- normalization
- categorical feature
- Use one-hot encoding to train machine learning model.
- numeric feature
Find Optimal K
- Elbow Method
- Silhoutte Coefficient
- To find our optimal K, we used 2 methods mentioned above.
- For k in range 2 to 15, we drawed graph to check it visually.
- Through these two method, we could get Optimal K = 6 in common.
Clustering with 6 clusters
- Use KMeans, DBSCAN, Hierarchical clustering to clustering data.
- Preceed clustering with k = 6.
- There are 2 nominal features, Region and Channel, and we can get 6 combinations from those features.
- And we now can see that those 6 combinations corrrspond to 6 labels one-on-one.
6 clusters are created.
KMeans Hierarchical Clusters -
Result of each cluster.
Results of wholesale clustering
- DBSCAN is sensitive to density of data.
- Data in each column are gathered in short range of sales amount (7000-12000).
Density plot - Data are gathered in high and even density.
- It is hard to find appropriate epsilon and MinPts.
- Because data are gathered in high density, slight change in hyperparameter occurs huge change in clustering result.