Clustering algorithm for Mixed data of categorial and numerical (ordinal and nonordinal) data using LSH.
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_clustering_categorical_data.ipynb
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_clustering_mixed_data_type.ipynb
https://github.com/nmtoan91/lshkrepresentatives/blob/main/notebook_sample_LSHkRepresentatives_Full.ipynb
Note 1: Different from k-Modes algorithm, LSH-k-Representatives define the "representatives" that keep the frequencies of all categorical values of the clusters. There are threee algorithms Note 2: The dataset is auto normalized if it detect string, or disjointed data, or nan
pip install lshkrepresentatives numpy scikit-learn pandas networkx termcolor
import numpy as np
from LSHkRepresentatives.LSHkRepresentatives import LSHkRepresentatives
X = np.array([['red',0,np.nan],['green',1,1],['blue',0,0],[1,5111,1],[2,2,2],[2,6513,'rectangle'],[2,3,6565]])
#Init instance of LSHkRepresentatives
kreps = LSHkRepresentatives(n_clusters=2,n_init=5)
#Do clustering for dataset X
labels = kreps.fit(X)
#Print the label for dataset X
print('Labels:',labels)
#Predict label for the random instance x
x = np.array(['red',5111,0])
label = kreps.predict(x)
print(f'Cluster of object {x} is: {label}')
SKIP LOADING distMatrix because: False bd=None
Generating disMatrix for DILCA
Saving DILCA to: saved_dist_matrices/json/DILCA_None.json
Generating LSH hash table: hbits: 2(4) k 1 d 3 n= 7
LSH time: 0.006518099999993865 Score: 6.333333333333334 Time: 0.0003226400000130525
Labels: [1 1 1 1 0 0 0]
Cluster of object [1 2 0] is: 1
y = np.array([0,0,0,0,1,1,1])
kreps.CalcScore(y)
Purity: 1.00 NMI: 1.00 ARI: 1.00 Sil: 0.59 Acc: 1.00 Recall: 1.00 Precision: 1.00
For example: We have a dataset of 5 attributes (3 categorical and 2 numerical).
from LSHkRepresentatives.LSHkPrototypes import LSHkPrototypes
kprototypes = LSHkPrototypes(n_clusters=2,n_init=5)
X = np.array([['red',0,np.nan,1,1],
['green',1,1,0,0],
['blue',0,0,3,4],
[1,5111,1,1.1,1.2],
[2,2,2,29.0,38.9],
[2,6513,'rectangle',40,41.1],
['red',0,np.nan,30.4,30.1]])
attributeMasks = [0,0,0,1,1]
# attributeMasks = [0,0,0,1,1] means attributes are
# [categorial,categorial,categorial,numerical,numerical]
a = kprototypes.fit(X,attributeMasks,numerical_weight=2, categorical_weight=1)
print(a)
T. N. Mau and V.-N. Huynh, ``An LSH-based k-Representatives Clustering Method for Large Categorical Data." Neurocomputing, Volume 463, 2021, Pages 29-44, ISSN 0925-2312, https://doi.org/10.1016/j.neucom.2021.08.050.
@article{mau2021lsh,
title={An LSH-based k-representatives clustering method for large categorical data},
author={Mau, Toan Nguyen and Huynh, Van-Nam},
journal={Neurocomputing},
volume={463},
pages={29--44},
year={2021},
publisher={Elsevier}
}
https://pypi.org/project/lshkrepresentatives/
https://github.com/nmtoan91/lshkrepresentatives