Clusters-Features : a Python module to evaluate the quality of clustering

Made during an internship for iCube in Strasbourg, FR. The full report in French is available here : https://drive.google.com/file/d/1kkyCruqZeHsG30pGX1YvkyuHXZfx-BOf/

This package is made for unsupervised learning. All criterias are used with internal validation and make no use of ground-truth labels.

Official documentation : simon-bertrand.github.io/Clusters-Features/

Clusters-Features is a package that computes many operations using only the dataset and the target vector.

Data
The package provides all the usefull data such as pairwise distances or distances between every elements and the centroid of given cluster. You can also check for the maximum/minimum distances between two elements of different clusters or even each intercentroid distances. But you can also get different radius for each centroids and analyse them to firstly understand the shape of the clusters. The distribution of all radius for each cluster is also available. All informative data is contained inside the subclass Data.

Score
Approximatively 40 different internal indices have been implemented in Python : Ball-Hall Index, Dunn Index, Generalized Dunn Indexes (18 indexes), C Index, Banfeld-Raftery Index, Davies-Bouldin Index, Calinski-Harabasz Index, Ray-Turi Index, Xie-Beni Index, Ratkowsky Lance Index, SD Index, Mclain Rao Index, Scott-Symons Index, PBM Index, Point biserial Index, Det Ratio Index, Log SumSquare Ratio Index, Wemmert-Gançarski Index. We use two systems to generate these index, the first is caching each computed index and the other directly compute them. All these score are defined in the main reference, check for the Score section to find the reference.

Confusion Hypersphere
Clusters-Features also provides a deep analysis of the multidimensionnal space. Confusion Hypersphere consists of counting the number of element contained in several hyperspheres centered on different positions and with different radius. This feature allows users to understand which clusters are confused (in the sense of the Euclidean norm) with other clusters. These indicators make it possible to determine which clusters are the most separated from the others and this is clearly adapted to convex clusters since the hypersphere is convex.

Info
Info gives two kind of boards such as clusters board which gives you information for each clusters. The general board gives informations at a general scale of the dataset .

Density
This section uses a meshgrid to estimate a density by summing n-dim (for n=2 or n=3) Gaussian distrubution centred on each dataset points. We put the minimum contour as a given percentile of the current density. If percentile is 99% then only 1% of the highest density values will be retained. We can make it for 2D grid or 3D grid but it is quickly limited due to the large number of combinations needed to generate an n-dim grid.

Utils
Implement external packages and utils to the current dataset.

Graph (Falcutative)
Graph allows users to plot few kind of data generated by Clusters-Features. As Plotly is used to plot, this section is facultative in the case where user only need to get the different data and matrix to plot with their own module. In order to disable this section, you will have to go to settings.py and put to False the variable "Activated_Graph" and then re-build the package using setuptools. All requirements.txt are going to be generated in consequences of these settings.

Dependencies

Native dependencies :

Falcultative dependencies (may cause errors if the user forces the use of the method of these falcultative dependencies without having installed the correct libraries) :

Graph : Plotly
Utils : umap-learn, Numba, statsmodels

Graph is dependent of Utils to correctly work but the reciprocal is not true.

Command Line Interface

This package provides a command line interface that is available by running this command

python3 ./clustersfeatures-cli.py -h

The documentation for the CLI is contained inside the script. Just use --help arguments to understand what it does.

Import the module

from ClustersFeatures import *

Load a random data set

We choose here the scikit-learn digits data set because it is in high dimension (64) and has a large number of observations.

from sklearn.datasets import load_digits
import pandas as pd
digits = load_digits()
pd_df=pd.DataFrame(digits.data)
pd_df['target'] = digits.target
pd_df

	0	1	2	3	4	5	6	7	8	9	...	55	56	57	58	59	60	61	62	63	target
0	0.0	0.0	5.0	13.0	9.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	6.0	13.0	10.0	0.0	0.0	0.0	0
1	0.0	0.0	0.0	12.0	13.0	5.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	11.0	16.0	10.0	0.0	0.0	1
2	0.0	0.0	0.0	4.0	15.0	12.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	3.0	11.0	16.0	9.0	0.0	2
3	0.0	0.0	7.0	15.0	13.0	1.0	0.0	0.0	0.0	8.0	...	0.0	0.0	0.0	7.0	13.0	13.0	9.0	0.0	0.0	3
4	0.0	0.0	0.0	1.0	11.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	2.0	16.0	4.0	0.0	0.0	4
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1792	0.0	0.0	4.0	10.0	13.0	6.0	0.0	0.0	0.0	1.0	...	0.0	0.0	0.0	2.0	14.0	15.0	9.0	0.0	0.0	9
1793	0.0	0.0	6.0	16.0	13.0	11.0	1.0	0.0	0.0	0.0	...	0.0	0.0	0.0	6.0	16.0	14.0	6.0	0.0	0.0	0
1794	0.0	0.0	1.0	11.0	15.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	2.0	9.0	13.0	6.0	0.0	0.0	8
1795	0.0	0.0	2.0	10.0	7.0	0.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	5.0	12.0	16.0	12.0	0.0	0.0	9
1796	0.0	0.0	10.0	14.0	8.0	1.0	0.0	0.0	0.0	2.0	...	0.0	0.0	1.0	8.0	12.0	14.0	12.0	1.0	0.0	8

1797 rows × 65 columns

The important thing is that the given "pd_df" dataframe in the following argument has to be concatenated with the target vector. Then, just specify as second argument which column name has the target. The program is making automatically the separation :

CC=ClustersCharacteristics(pd_df,label_target="target")

Data tools

The ClustersCharacteristics object creates attributes that define clusters. We can find for example the barycenter.

CC.data_barycenter

0      0.000000
1      0.303840
2      5.204786
3     11.835838
4     11.848080
          ...    
59    12.089037
60    11.809126
61     6.764051
62     2.067891
63     0.364496
Length: 64, dtype: float64

But also centroids, where the column j of the following matrix correspond to the coordinates of centroid of cluster j.

CC.data_centroids

target	0	1	2	3	4	5	6	7	8	9
0	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000	0.000000
1	0.022472	0.010989	0.932203	0.644809	0.000000	0.967033	0.000000	0.167598	0.143678	0.144444
2	4.185393	2.456044	9.666667	8.387978	0.453039	9.983516	1.138122	5.100559	5.022989	5.683333
3	13.095506	9.208791	14.186441	14.169399	7.055249	13.038462	11.165746	13.061453	11.603448	11.833333
4	11.297753	10.406593	9.627119	14.224044	11.497238	13.895604	9.585635	14.245810	12.402299	11.255556
...	...	...	...	...	...	...	...	...	...	...
59	13.561798	9.137363	13.966102	14.650273	7.812155	14.736264	10.685083	11.659218	12.695402	12.044444
60	13.325843	13.027473	13.118644	13.972678	11.812155	9.362637	15.093923	2.206704	13.011494	13.144444
61	5.438202	8.576923	11.796610	8.672131	1.955801	2.532967	13.044199	0.011173	6.735632	8.894444
62	0.275281	3.049451	8.022599	1.409836	0.000000	0.197802	4.480663	0.000000	1.206897	2.094444
63	0.000000	1.494505	1.932203	0.065574	0.000000	0.000000	0.093923	0.000000	0.011494	0.055556

64 rows × 10 columns

We can show the list of clusters labels :

CC.labels_clusters

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

And look for the data with the same label target. For example we take here the first cluster label of the above list.

Cluster=CC.labels_clusters[0]
CC.data_clusters[Cluster]

	0	1	2	3	4	5	6	7	8	9	...	54	55	56	57	58	59	60	61	62	63
0	0.0	0.0	5.0	13.0	9.0	1.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	6.0	13.0	10.0	0.0	0.0	0.0
10	0.0	0.0	1.0	9.0	15.0	11.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	1.0	10.0	13.0	3.0	0.0	0.0
20	0.0	0.0	3.0	13.0	11.0	7.0	0.0	0.0	0.0	0.0	...	1.0	0.0	0.0	0.0	2.0	12.0	13.0	4.0	0.0	0.0
30	0.0	0.0	10.0	14.0	11.0	3.0	0.0	0.0	0.0	4.0	...	0.0	0.0	0.0	0.0	11.0	16.0	12.0	3.0	0.0	0.0
36	0.0	0.0	6.0	14.0	10.0	2.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	7.0	16.0	11.0	1.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1739	0.0	0.0	10.0	11.0	7.0	0.0	0.0	0.0	0.0	4.0	...	1.0	0.0	0.0	0.0	7.0	12.0	8.0	0.0	0.0	0.0
1745	0.0	0.0	7.0	14.0	8.0	4.0	0.0	0.0	0.0	0.0	...	0.0	0.0	0.0	0.0	6.0	13.0	7.0	0.0	0.0	0.0
1746	0.0	0.0	9.0	15.0	6.0	0.0	0.0	0.0	0.0	2.0	...	1.0	0.0	0.0	0.0	8.0	15.0	11.0	4.0	0.0	0.0
1768	0.0	0.0	5.0	16.0	10.0	0.0	0.0	0.0	0.0	0.0	...	5.0	0.0	0.0	0.0	4.0	15.0	16.0	8.0	1.0	0.0
1793	0.0	0.0	6.0	16.0	13.0	11.0	1.0	0.0	0.0	0.0	...	1.0	0.0	0.0	0.0	6.0	16.0	14.0	6.0	0.0	0.0

178 rows × 64 columns

Users are able to get a pairwise distance matrix generated by the Scipy library (fast). If (xi,j)i,j is the returned matrix, then xi,j is the distance between element of index i and element of index j. The matrix is symetric as we use Euclidian norm to evaluate distances.

CC.data_every_element_distance_to_every_element

	0	1	2	3	4	5	6	7	8	9	...	1787	1788	1789	1790	1791	1792	1793	1794	1795	1796
0	0.000000	59.556696	54.129474	47.571000	50.338852	43.908997	48.559242	56.000000	44.395946	40.804412	...	39.874804	49.749372	52.640289	51.458721	49.989999	36.249138	26.627054	50.378567	37.067506	47.031904
1	59.556696	0.000000	41.629317	45.475268	47.906158	47.127487	40.286474	50.960769	48.620983	52.820451	...	52.009614	48.969378	42.965102	32.572995	47.707442	51.390661	59.177699	38.587563	48.569538	50.328918
2	54.129474	41.629317	0.000000	53.953684	52.096065	55.443665	45.650849	49.335586	42.602817	54.836119	...	59.076222	47.927028	46.335731	39.191836	46.936127	51.826634	52.009614	38.340579	50.774009	43.954522
3	47.571000	45.475268	53.953684	0.000000	51.215232	33.660065	47.254629	56.824291	42.449971	45.166359	...	37.934153	55.569776	50.099900	43.988635	58.566202	40.286474	55.551778	49.527770	44.147480	41.267421
4	50.338852	47.906158	52.096065	51.215232	0.000000	54.147945	36.959437	59.481089	52.507142	55.054518	...	48.620983	26.172505	55.794265	48.723711	31.416556	53.981478	51.449004	46.882833	52.668776	50.970580
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
1792	36.249138	51.390661	51.826634	40.286474	53.981478	29.325757	52.191953	55.605755	40.037482	36.262929	...	31.749016	54.543561	55.758407	48.083261	55.488738	0.000000	41.940434	46.151923	23.537205	40.963398
1793	26.627054	59.177699	52.009614	55.551778	51.449004	49.325450	45.354162	60.456596	48.041649	47.265209	...	43.416587	45.912961	53.272882	52.449976	46.324939	41.940434	0.000000	46.957428	42.438190	46.465041
1794	50.378567	38.587563	38.340579	49.527770	46.882833	46.904158	33.466401	54.516053	34.885527	49.929951	...	45.077711	46.421978	33.896903	29.189039	42.602817	46.151923	46.957428	0.000000	44.158804	28.879058
1795	37.067506	48.569538	50.774009	44.147480	52.668776	32.557641	48.207883	55.928526	37.000000	28.827071	...	38.183766	50.507425	54.359912	47.265209	48.754487	23.537205	42.438190	44.158804	0.000000	39.420807
1796	47.031904	50.328918	43.954522	41.267421	50.970580	38.496753	40.224371	56.267220	28.337255	40.926764	...	38.288379	50.941143	38.820098	38.600518	49.223978	40.963398	46.465041	28.879058	39.420807	0.000000

1797 rows × 1797 columns

While centroids are not elements of the dataset, we can also compute the distance between each element to each centroid.

CC.data_every_element_distance_to_centroids

	0	1	2	3	4	5	6	7	8	9
0	14.013361	47.567376	43.896678	39.554151	40.407399	36.647929	41.599287	43.074401	37.369109	32.423583
1	54.059820	19.017525	38.701490	42.313696	38.269485	42.273369	44.388144	40.861554	33.800663	44.312148
2	47.757029	32.206345	37.375370	45.438311	43.187064	50.233787	43.272912	41.584089	33.846710	45.754658
3	44.250476	36.468356	33.283540	22.386098	48.069136	36.198086	41.894212	45.404349	36.063988	31.605201
4	45.592148	39.322928	52.408033	51.138040	28.340976	48.653228	39.984571	48.264247	44.208386	48.142841
...	...	...	...	...	...	...	...	...	...	...
1792	34.293071	41.151239	43.677849	30.575459	45.769008	36.516206	46.428263	42.576472	33.327387	16.959423
1793	20.429465	48.646926	47.491512	45.995613	40.876460	40.516369	41.939685	46.740119	40.285358	40.804954
1794	44.631741	29.885611	38.886808	43.396579	36.316532	42.594489	37.825320	42.725794	25.598846	42.437926
1795	34.565247	39.389382	43.806621	35.557630	41.311856	38.202818	41.673728	42.664164	32.926630	25.207579
1796	41.031409	37.724803	37.444086	36.772758	42.390657	40.799218	33.921312	45.775584	28.071988	35.917805

1797 rows × 10 columns

It is possible to generate a matrix of intercentroid distance. If (xi,j)i,j is the returned matrix, then xi,j is the distance between centroid of cluster i to centroid of cluster j. These distances are not related to points of the dataset. We put NaN into the diagonal terms in order to facilitate the manipulation of min/max.

	0	1	2	3	4	5	6	7	8	9
0	nan	42.026024	39.274919	37.062579	35.981220	34.078029	34.274506	41.772576	32.909593	29.617374
1	42.026024	nan	28.949723	31.742287	28.674700	32.469295	34.570287	31.187817	20.950348	32.126942
2	39.274919	28.949723	nan	26.489600	42.689686	32.375712	36.657425	35.570382	25.605848	32.960968
3	37.062579	31.742287	26.489600	nan	43.499594	29.822474	41.152654	33.369483	25.511462	21.103269
4	35.981220	28.674700	42.689686	43.499594	nan	35.577158	30.756650	33.444921	31.858925	38.689544
5	34.078029	32.469295	32.375712	29.822474	35.577158	nan	35.573804	32.098017	25.867262	28.060732
6	34.274506	34.570287	36.657425	41.152654	30.756650	35.573804	nan	43.514148	31.227114	39.306699
7	41.772576	31.187817	35.570382	33.369483	33.444921	32.098017	43.514148	nan	27.364089	33.513179
8	32.909593	20.950348	25.605848	25.511462	31.858925	25.867262	31.227114	27.364089	nan	24.630553
9	29.617374	32.126942	32.960968	21.103269	38.689544	28.060732	39.306699	33.513179	24.630553	nan

Scores

There are many indices that allow users to evaluate the quality of clusters, such as internal cluster validation indices. In Python development, some libraries compute such scores, but it is not completely done. In this library, these scores have been implemented :

Total dispersion matrix
Within cluster dispersion matrixes
Between group dispersion matrix
Total sum square
Pooled within cluster dispersion

The implemented indexes are :

Ball-Hall Index
Dunn Index
Generalized Dunn Indexes (18 indexes)
C Index
Banfeld-Raftery Index
Davies-Bouldin Index
Calinski-Harabasz Index
Ray-Turi Index
Xie-Beni Index
Ratkowsky Lance Index
SD Index
Mclain Rao Index
Scott-Symons Index
PBM Index
Point biserial Index
Det Ratio Index
Log SumSquare Ratio Index
Silhouette Index (computed with scikit-learn)
Wemmert-Gançarski Index (Thanks to M.Gançarski for this intership)

Main reference for all these scores :

Clustering Indices

Bernard Desgraupes, University Paris Ouest - Lab Modal’X , November 2017

https://cran.r-project.org/web/packages/clusterCrit/vignettes/clusterCrit.pdf

In this library, there are two types of methods to calculate these scores: Using IndexCore which automatically caches the already calculated indexes or calculating directly using the "score_index_" methods. The second method can make the calculation of the same index repetitive, which can be very slow because we know that some of these indexes have a very high computational complexity.

First method using IndexCore (faster)

CC.compute_every_index()

{'general': {'max': {'Between-group total dispersion': 908297.1736053203,
'Mean quadratic error': 696.0267765360618,
'Silhouette Index': 0.16294320522575195,
'Dunn Index': 0.25897601382124175,
'Generalized Dunn Indexes': {'GDI (1, 1)': 0.25897601382124175,
'GDI (1, 2)': 0.9076143747196692,
'GDI (1, 3)': 0.3158503201955148,
'GDI (2, 1)': 0.25897601382124175,
'GDI (2, 2)': 0.9076143747196692,
'GDI (2, 3)': 0.3158503201955148,
'GDI (3, 1)': 0.5790691834873279,
'GDI (3, 2)': 2.0294215944379173,
'GDI (3, 3)': 0.7062398726473335,
'GDI (4, 1)': 0.2875582147985151,
'GDI (4, 2)': 1.0077843328765126,
'GDI (4, 3)': 0.35070952278095474,
'GDI (5, 1)': 0.28515682596025516,
'GDI (5, 2)': 0.9993683603053982,
'GDI (5, 3)': 0.34778075952490317,
'GDI (6, 1)': 0.6033066382644287,
'GDI (6, 2)': 2.1143648370097905,
'GDI (6, 3)': 0.735800169522378},
'Wemmert-Gancarski Index': 0.2502241827215019,
'Calinski-Harabasz Index': 144.1902786959258,
'Ratkowsky-Lance Index': nan,
'Point Biserial Index': -4.064966952313242,
'PBM Index': 34.22417733472788},
'max diff': {'Trace WiB Index': nan, 'Trace W Index': 1250760.117435303},
'min': {'Banfeld-Raftery Index': 11718.207536490032,
'Ball Hall Index': 695.801129352618,
'C Index': 0.1476415026698158,
'Ray-Turi Index': 1.5857819700225737,
'Xie-Beni Index': 1.9551313947642188,
'Davies Bouldin Index': 2.1517097380390937,
'SD Index': [array([0.627482, 0.070384])],
'Mclain-Rao Index': 0.7267985756237975,
'Scott-Symons Index': nan},
'min diff': {'Det Ratio Index': nan,
'Log BGSS/WGSS Index': -0.3199351306684197,
'S_Dbw Index': nan,
'Nlog Det Ratio Index': nan}},
'clusters': {'max': {'Centroid distance to barycenter': [26.422334274375757,
20.184062405495773,
22.958470492954795,
21.71559561353746,
25.717240507145213,
20.283308612864644,
26.419951469008378,
24.426658073844308,
13.44306158441342,
19.876908956223936],
'Between-group Dispersion': [124268.87523421964,
74146.1402843885,
93295.17202553005,
86296.77799167577,
119709.13913372548,
74877.09470781704,
126340.5042480812,
106802.4308135141,
31444.567428645732,
71116.47173772276],
'Average Silhouette': [0.3608993843537291,
0.05227459502398472,
0.14407593888502124,
0.15076708301431302,
0.16517001390130848,
0.1194825125348905,
0.28763816949713245,
0.19373598833558672,
0.08488231267929798,
0.07117051617968871],
'KernelDensity mean': [-87.26207798353086,
-102.79627948741418,
-118.2807433740146,
-102.80193279131969,
-102.79094365877583,
-102.79645332546204,
-87.27879146450985,
-102.77983243274437,
-118.2636521439672,
-118.29755528330563],
'Ball Hall Index': [396.35042923873254,
940.6359437266029,
751.2059752944557,
633.6276389262146,
736.2863160465186,
757.3853701243812,
512.8915478770488,
734.7467931712492,
741.1588717135685,
753.7224074074073]},
'min': {'Within-Cluster Dispersion': [70550.3764044944,
171195.74175824173,
132963.45762711865,
115953.85792349727,
133267.82320441987,
137844.13736263738,
92833.37016574583,
131519.67597765362,
128961.64367816092,
135670.03333333333],
'Largest element distance': [54.543560573178574,
72.85602240034794,
67.0,
62.3377895020348,
71.69379331573968,
66.53570470055908,
61.155539405682624,
67.93379129711516,
61.171888968708494,
63.773035054010094],
'Inter-element mean distance': [27.495251790928528,
41.577045912127325,
37.66525398978789,
34.81272464303223,
37.28558306007683,
38.08288651715454,
31.222158502521683,
37.241230341156786,
37.938810358062234,
37.830620986872184],
'Davies Bouldin Index': array([1.55628353, 2.70948787, 2.09498538, 2.43120015, 1.96455875,
      2.09074874, 1.58102612, 1.94811882, 2.70948787, 2.43120015]),
'C Index': [0.15780619270180213,
0.4626045226116365,
0.37889533673771314,
0.31459485530776515,
0.3693066184157008,
0.38636193134197444,
0.23717385124578905,
0.36902306811086555,
0.3857833597084178,
0.3815092165505222]}},
'radius': {'min': {'Radius min': {0: 11.963104233270684,
1: 16.495963249417844,
2: 17.228366828448973,
3: 15.096075210359995,
4: 15.943646753449636,
5: 16.46455777853301,
6: 12.786523861254974,
7: 14.61523732739271,
8: 18.374826032773953,
9: 16.317673899226},
'Radius mean': {0: 19.364954,
1: 29.868519,
2: 26.747682,
3: 24.578193,
4: 26.464614,
5: 27.18575,
6: 22.162453,
7: 26.412302,
8: 26.896195,
9: 26.728077},
'Radius median': {0: 19.090152,
1: 27.705495,
2: 25.299287,
3: 23.495162,
4: 26.434238,
5: 27.194139,
6: 21.579562,
7: 25.358031,
8: 26.982504,
9: 25.201186},
'Radius 75th Percentile': {0: 22.142983,
1: 35.627396,
2: 30.263862,
3: 27.808539,
4: 29.727508,
5: 29.221274,
6: 24.736136,
7: 30.21675,
8: 30.137334,
9: 29.966933},
'Radius max': {0: 35.381597,
1: 48.76808,
2: 48.6619,
3: 40.02036,
4: 51.535976,
5: 40.584931,
6: 42.250871,
7: 44.424333,
8: 38.175815,
9: 45.985382}}}}

We can take the corresponding code in the indices.json file with this call

CC._get_all_index

{'general': {'max': {'Between-group total dispersion': 'G-Max-01', 'Mean quadratic error': 'G-Max-02', 'Silhouette Index': 'G-Max-03', 'Dunn Index': 'G-Max-04', 'Generalized Dunn Indexes': 'G-Max-GDI', 'Wemmert-Gancarski Index': 'G-Max-05', 'Calinski-Harabasz Index': 'G-Max-06', 'Ratkowsky-Lance Index': 'G-Max-07', 'Point Biserial Index': 'G-Max-08', 'PBM Index': 'G-Max-09'}, 'max diff': {'Trace WiB Index': 'G-MaxD-01', 'Trace W Index': 'G-MaxD-02'}, 'min': {'Banfeld-Raftery Index': 'G-Min-01', 'Ball Hall Index': 'G-Min-02', 'C Index': 'G-Min-03', 'Ray-Turi Index': 'G-Min-04', 'Xie-Beni Index': 'G-Min-05', 'Davies Bouldin Index': 'G-Min-06', 'SD Index': 'G-Min-07', 'Mclain-Rao Index': 'G-Min-08', 'Scott-Symons Index': 'G-Min-09'}, 'min diff': {'Det Ratio Index': 'G-MinD-01', 'Log BGSS/WGSS Index': 'G-MinD-02', 'S_Dbw Index': 'G-MinD-03', 'Nlog Det Ratio Index': 'G-MinD-04'}}, 'clusters': {'max': {'Centroid distance to barycenter': 'C-Max-01', 'Between-group Dispersion': 'C-Max-02', 'Average Silhouette': 'C-Max-03', 'KernelDensity mean': 'C-Max-04', 'Ball Hall Index': 'C-Max-05'}, 'min': {'Within-Cluster Dispersion': 'C-Min-01', 'Largest element distance': 'C-Min-02', 'Inter-element mean distance': 'C-Min-03', 'Davies Bouldin Index': 'C-Min-04', 'C Index': 'C-Min-05'}}, 'radius': {'min': {'Radius min': 'R-Min-01', 'Radius mean': 'R-Min-02', 'Radius median': 'R-Min-03', 'Radius 75th Percentile': 'R-Min-04', 'Radius max': 'R-Min-05'}}}

These codes are usefull when you want to generate a single index using IndexCore :

CC.generate_output_by_info_type("general", "max", "G-Max-01")
908297.1736053203

Second method using "score_index_" methods

CC.score_between_group_dispersion()
908297.1736053203

Make the same result as above but it computes a second time the same score.

Speed test of different scores

pd_df :  
shape - (1797, 65) 
 total elements=116805 
 
Columns types:
pd_df.dtypes.value_counts() : 64 x float64 + 1 x Int32

score_index_ball_hall 

5.06 ms ± 79.9 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_banfeld_Raftery

5.02 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_c

104 ms ± 550 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_c_for_each_cluster

95.5 ms ± 720 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_calinski_harabasz 

16.5 ms ± 257 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_davies_bouldin 

12.3 ms ± 76.5 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_davies_bouldin_for_each_cluster 

12.3 ms ± 300 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_det_ratio 

181 ms ± 4.09 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score_index_dunn

19.6 ms ± 1.06 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_generalized_dunn_matrix

994 ms ± 41.4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score_index_Log_Det_ratio

180 ms ± 2.86 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_log_ss_ratio 

16.3 ms ± 249 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_mclain_rao 

63.5 ms ± 6.55 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_PBM 

23.5 ms ± 3.84 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_point_biserial

50.3 ms ± 434 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_ratkowsky_lance 

12.3 ms ± 254 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_ray_turi 

23.2 ms ± 889 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_scott_symons

153 ms ± 6.25 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_SD 

211 ms ± 4.37 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

score_index_trace_WiB 

138 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

score_index_wemmert_gancarski 

8.13 ms ± 93 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

score_index_xie_beni

85.8 ms ± 1.31 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Confusion Hypersphere

The confusion hypersphere subclass counts the number of element contained inside a n-dim sphere (hypersphere) of given radius and centered on each cluster centroid. The given radius is the same for each hypersphere.

Args : "counting_type=" : ('including' or 'excluding') - If including, then the elements belonging cluster i and contained inside the hypersphere of centroid i are counted (for i=j). If excluding, then they're not counted. "proportion=" : (bool) Return the proportion of element. Default option = False.

self.confusion_hypersphere_matrix

CC.confusion_hypersphere_matrix(radius=35, counting_type="including", proportion=True)

	C:0	C:1	C:2	C:3	C:4	C:5	C:6	C:7	C:8	C:9
H:0	0.994382	0.000000	0.000000	0.010929	0.000000	0.032967	0.060773	0.000000	0.005747	0.211111
H:1	0.000000	0.736264	0.090395	0.103825	0.187845	0.016484	0.110497	0.055866	0.574713	0.022222
H:2	0.000000	0.142857	0.881356	0.355191	0.000000	0.005495	0.000000	0.000000	0.310345	0.000000
H:3	0.000000	0.005495	0.225989	0.950820	0.000000	0.258242	0.000000	0.016760	0.327586	0.666667
H:4	0.050562	0.032967	0.000000	0.000000	0.928177	0.027473	0.154696	0.027933	0.028736	0.000000
H:5	0.095506	0.000000	0.000000	0.103825	0.005525	0.950549	0.022099	0.005587	0.293103	0.133333
H:6	0.089888	0.027473	0.000000	0.000000	0.033149	0.021978	0.983425	0.000000	0.068966	0.000000
H:7	0.000000	0.071429	0.028249	0.071038	0.060773	0.005495	0.000000	0.882682	0.201149	0.055556
H:8	0.044944	0.423077	0.293785	0.431694	0.011050	0.170330	0.110497	0.184358	0.977011	0.394444
H:9	0.421348	0.000000	0.005650	0.759563	0.000000	0.351648	0.000000	0.000000	0.356322	0.872222

To interpret this, if (xi,j)i,j is the returned matrix, then xi,j is the number of elements belonging cluster j that are contained inside the hypersphere with given radius centered on centroid of cluster i . If proportion is on True, then the number of elements becomes the proportion of elements belonging cluster j.

self.confusion_hypersphere_for_linspace_radius_each_element

This method returns the results of the above method for a linear radius space. "n_pts=" allows users to set the radius range.

CC.confusion_hypersphere_for_linspace_radius_each_element(radius=35, counting_type="excluding", n_pts=10)

	0	1	2	3	4	5	6	7	8	9
Radius
0.0000	0	0	0	0	0	0	0	0	0	0
7.1578	0	0	0	0	0	0	0	0	0	0
14.3155	0	0	0	0	0	0	0	0	0	0
21.4733	0	0	0	0	0	0	0	0	0	0
28.6311	3	10	3	36	1	24	1	0	34	30
35.7889	192	171	161	398	71	211	122	68	473	325
42.9466	1004	747	802	1023	641	940	837	765	1285	950
50.1044	1567	1346	1397	1470	1318	1536	1534	1369	1558	1479
57.2622	1602	1625	1589	1636	1624	1638	1629	1603	1566	1614
64.4200	1602	1638	1593	1647	1629	1638	1629	1611	1566	1620

confusion_hyperphere_around_specific_point_for_two_clusters

This method returns the number of elements belonging given Cluster1 or given Cluster2 that are contained inside the hypersphere of given radius and centered on given Point.

Point= CC.data_features.iloc[0] #Choose an  observation  of the dataset
Cluster1= CC.labels_clusters[0] #Choose the cluster 1
Cluster2=CC.labels_clusters[1] #Choose the cluster 2
radius=110 #Large radius to capture the total of both clusters, the result should be the sum of data_clusters[Cluster1] and data_clusters[Cluster2] cardinals

CC.confusion_hyperphere_around_specific_point_for_two_clusters(Point,Cluster1,Cluster2, radius)

0    360 
dtype: int64

360 elements belonging Cluster or Cluster2 are contained inside this hypersphere.

Info

The Info subclass shows two different informative boards that gives many kinds of informations about the general dataset and the clusters. The type column can be : "max", "min", "max diff", "min diff". If 'max' (respect. 'min'), then higher (respect. lower) is the score, the better is the clustering. For "max diff" and "min diff", it is usefull to use them when you need to find the best number of clusters. Max diff will correspond to the maximum difference between clustering 1 with K clusters and clustering 2 with K' clusters (K!=K'). See the Bernard Desgraupes reference for more explanations.

CC.general_info(hide_nan=False)

Current NaN Index :

Ratkowsky-Lance Index    -          G-Max-07
Trace WiB Index          -          G-MaxD-01
Scott-Symons Index       -          G-Min-09
Det Ratio Index          -          G-MinD-01
S_Dbw Index              -          G-MinD-03
Nlog Det Ratio Index     -          G-MinD-04

		General Informations
Between-group total dispersion	max	908297.173605
Mean quadratic error	max	696.026777
Silhouette Index	max	0.162943
Dunn Index	max	0.258976
Wemmert-Gancarski Index	max	0.250224
Calinski-Harabasz Index	max	144.190279
Point Biserial Index	max	-4.064967
PBM Index	max	34.224177
Trace W Index	max diff	1250760.117435
Banfeld-Raftery Index	min	11718.207536
Ball Hall Index	min	695.801129
C Index	min	0.147642
Ray-Turi Index	min	1.585782
Xie-Beni Index	min	1.955131
Davies Bouldin Index	min	2.15171
SD Index	min	[[0.627482, 0.070384]]
Mclain-Rao Index	min	0.726799
Log BGSS/WGSS Index	min diff	-0.319935
GDI (1, 1)	max	0.258976
GDI (1, 2)	max	0.907614
GDI (1, 3)	max	0.31585
GDI (2, 1)	max	0.258976
GDI (2, 2)	max	0.907614
GDI (2, 3)	max	0.31585
GDI (3, 1)	max	0.579069
GDI (3, 2)	max	2.029422
GDI (3, 3)	max	0.70624
GDI (4, 1)	max	0.287558
GDI (4, 2)	max	1.007784
GDI (4, 3)	max	0.35071
GDI (5, 1)	max	0.285157
GDI (5, 2)	max	0.999368
GDI (5, 3)	max	0.347781
GDI (6, 1)	max	0.603307
GDI (6, 2)	max	2.114365
GDI (6, 3)	max	0.7358

CC.clusters_info

		0	1	2	3	4	5	6	7	8	9
index	Type
Centroid distance to barycenter	max	26.42	20.18	22.95	21.71	25.71	20.28	26.41	24.42	13.44	19.87
Between-group Dispersion	max	124268	74146	93295	86296	119709	74877	126340	106802	31444	71116
Average Silhouette	max	0.36	0.05	0.14	0.15	0.16	0.11	0.28	0.19	0.08	0.07
KernelDensity mean	max	-87.26	-102.79	-118.28	-102.80	-102.79	-102.79	-87.27	-102.77	-118.26	-118.29
Ball Hall Index	max	396.35	940.63	751.20	633.62	736.28	757.38	512.89	734.74	741.15	753.72
Within-Cluster Dispersion	min	70550	171195	132963	115953	133267	137844	92833	131519	128961	135670
Largest element distance	min	54.54	72.85	67.00	62.33	71.69	66.53	61.15	67.93	61.17	63.77
Inter-element mean distance	min	27.49	41.57	37.66	34.81	37.28	38.08	31.22	37.24	37.93	37.83
Davies Bouldin Index	min	1.55	2.70	2.09	2.43	1.96	2.09	1.58	1.94	2.70	2.43
C Index	min	0.15	0.46	0.37	0.31	0.36	0.38	0.23	0.36	0.38	0.38
Radius min	min	11.96	16.49	17.22	15.09	15.94	16.46	12.78	14.61	18.37	16.31
Radius mean	min	19.36	29.86	26.74	24.57	26.46	27.18	22.16	26.41	26.89	26.72
Radius median	min	19.09	27.70	25.29	23.49	26.43	27.19	21.57	25.35	26.98	25.20
Radius 75th Percentile	min	22.14	35.62	30.26	27.80	29.72	29.22	24.73	30.21	30.13	29.96
Radius max	min	35.38	48.76	48.66	40.02	51.53	40.58	42.25	44.42	38.17	45.98

Density

The Density subclass is based on projection 2D or 3D using dimensionnality reductors such as PCA or UMAP. As UMAP is only possible in 2D, we will only use PCA for 3D Density graphs. The main idea for approximating density is about summing Gaussian Distribution n-dim laws centered on each dataset point on a meshgrid corresponding to 2D or 3D. This section returns a lot of data that are packed in a native Python dict. Each element returned (excluding the main return) inside the dict has to be activated by its own argument. See the following example:

self.density_projection_2D

Args:

reduction_method : "PCA" or "UMAP"
percentile : percentile of density that corresponds to the minimum value to show
return_data :If True, return 2D PCA Data
return_clusters_density : If True, return the 2D Grid with the Z values for each cluster

CC.density_projection_2D("PCA", 95, return_data=True, return_clusters_density=True)

{'Z-Grid':             -27.494448  -27.205068  -26.915688  -26.626307  -26.336927  \
 -31.169904    0.000000    0.000000    0.000000    0.000000    0.000000   
 -30.853975    0.000000    0.000000    0.000000    0.000000    0.000000   
 -30.538045    0.000000    0.000000    0.000000    0.000000    0.000000   
 -30.222115    0.000000    0.000000    0.000000    0.000000    0.000000   
 -29.906185    0.000000    0.000000    0.000000    0.000000    0.000000   
 ...                ...         ...         ...         ...         ...         
 30.436407     0.000000    0.000000    0.000000    0.000000    0.000000  
 30.752337     0.000000    0.000000    0.000000    0.000000    0.000000  
 31.068267     0.000000    0.000000    0.000000    0.000000    0.000000  
 31.384197     0.000000    0.000000    0.000000    0.000000    0.000000  
 31.700126     0.000000    0.000000    0.000000    0.000000    0.000000  
 
 [200 rows x 200 columns],
 'Clusters Density': {0: array([[0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
          2.26543202e-251, 6.72200949e-253, 1.80509292e-254],
         [0.00000000e+000, 0.00000000e+000, 0.00000000e+000, ...,
          2.39644309e-090, 2.99562311e-092, 3.38890645e-094]]),
  1: array([[1.22473352e-190, 9.95843640e-187, 7.32812819e-183, ...,
          1.89152820e-043, 5.31903131e-045, 1.35364451e-046],
         [6.13307159e-189, 4.98686467e-185, 3.66969092e-181, ...,
         [2.18683154e-176, 3.42747973e-173, 4.86168499e-170, ...,
          6.11416273e-291, 1.45371826e-293, 3.12806562e-296]]),
  2: array([[9.82858841e-081, 1.91715088e-079, 4.01121949e-078, ...,
          3.95263976e-164, 6.05314934e-167, 8.38934205e-170],
         [1.15888672e-078, 2.15624607e-077, 3.99557766e-076, ...,
          0.00000000e+000, 0.00000000e+000, 0.00000000e+000],
         [2.14277916e-156, 6.99479702e-154, 2.07015420e-151, ...,
          0.00000000e+000, 0.00000000e+000, 0.00000000e+000]]),
                              ...
          1.34702113e-296, 1.14119413e-299, 8.74977734e-303]]),
  8: array([[3.68438507e-120, 1.91948293e-117, 9.05015769e-115, ...,
          1.78634585e-151, 1.09968510e-153, 6.12666390e-156],
         [1.19395354e-118, 6.22023065e-116, 2.93277042e-113, ...,        
         [2.30028280e-221, 1.12585950e-218, 4.98917831e-216, ...,
          3.82113467e-265, 9.90939747e-269, 2.32570436e-272]]),
  9: array([[5.32990773e-136, 8.38778107e-133, 1.19461196e-129, ...,
          3.98260675e-172, 2.88650900e-175, 1.89334938e-178],
         [5.65560268e-135, 8.90033385e-132, 1.26761124e-128, ...,
         [3.33187669e-070, 4.02748126e-069, 4.64453319e-068, ...,
          1.02174508e-253, 1.38205190e-256, 1.69183696e-259]])},
 '2D PCA Data':            PCA0       PCA1
 0     -1.259467  21.274883
 1      7.957610 -20.768700
 ...            ...
 1795  -4.872099  12.423954
 1796  -0.344388   6.365550
 
 [1797 rows x 2 columns]}

self.density_projection_3D

Use PCA 3D to project the dataset and make a 3D meshgrid to estimate the density on it with the 3D Gaussian distribution. Args:

percentile : percentile of density that corresponds to the minimum value to show
return_grid :If True, return 3D Grid
return_clusters_density : If True, return the 3D Grid with the A values for each cluster

CC.density_projection_3D(99, return_grid=True, return_clusters_density=True)

{'A-Grid': array([[[3.48581797e-15, 1.62080230e-14, 6.90041904e-14, ...,
          5.83374041e-13, 1.92066889e-13, 5.70629214e-14],
         [6.60425258e-16, 6.17767595e-15, 5.01029852e-14, ...,
          3.52400611e-12, 5.50927146e-13, 7.45284376e-14]]]),
 'Clusters Density': {0: array([[[3.40385502e-48, 3.36307101e-47, 2.87017113e-46, ...,
           6.07579853e-65, 1.97018960e-66, 5.51854095e-68],
          [6.12214966e-32, 7.73521170e-31, 8.44230524e-30, ...,
           6.56667612e-49, 8.52648274e-51, 9.59653245e-53]],
         [[1.30566741e-19, 7.34944932e-19, 3.57377188e-18, ...,
           3.11782459e-28, 1.14260012e-29, 3.68476733e-31],
          [6.32366666e-22, 6.00865462e-21, 4.93929478e-20, ...,
           4.28020934e-27, 2.75837798e-28, 1.53709049e-29]]]),
  1: array([[[8.98310922e-30, 6.71213878e-29, 5.03213279e-28, ...,
           2.87124806e-29, 8.73589680e-30, 2.29930435e-30]
          [5.02339294e-21, 2.49290742e-20, 1.30264666e-19, ...,
           2.55638337e-14, 4.29622018e-15, 6.43949192e-16]],
          [2.83167266e-35, 8.65300585e-35, 2.28427319e-34, ...,
           8.11025209e-33, 1.61201265e-33, 2.82711740e-34]]]),
  2: array([[[1.36203887e-28, 9.67318866e-28, 5.96928017e-27, ...,
           1.54322295e-14, 5.09928476e-15, 1.48590282e-15],
                              ...
  8: array([[[6.84981203e-33, 1.26631274e-31, 2.37110234e-30, ...,
           9.64415278e-27, 5.73382414e-28, 3.22330467e-29],
          [1.74336209e-34, 9.68068937e-34, 4.64782487e-33, ...,
           1.05223775e-37, 1.79602795e-38, 2.64802829e-39]]]),
  9: array([[[3.60638003e-23, 1.23523984e-22, 5.16583215e-22, ...,
           5.75818352e-33, 1.73152931e-34, 4.49760341e-36],
           2.86151947e-52, 3.19445757e-53, 3.40743714e-54]]])},
 '3D Grid': {'X': array([[[-37.40388626, -37.40388626, -37.40388626, ..., -37.40388626,
           -37.40388626, -37.40388626],
          [ 38.04015058,  38.04015058,  38.04015058, ...,  38.04015058,
            38.04015058,  38.04015058]]]),
  'Y': array([[[-32.99333756, -32.99333756, -32.99333756, ..., -32.99333756,
           -32.99333756, -32.99333756],
            36.11064515,  36.11064515]]]),
  'Z': array([[[-35.1620997 , -33.64347275, -32.12484579, ...,  36.21336725,
            37.73199421,  39.25062116],
          [-35.1620997 , -33.64347275, -32.12484579, ...,  36.21336725,
            37.73199421,  39.25062116]]])}}

Utils

This section uses other modules to apply to the current self object. For example, PCA from scikit-learn is implemented. We also use UMAP from umap-learn. The list for utils methods :

self.utils_KernelDensity - https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KernelDensity.html#sklearn.neighbors.KernelDensity
self.utils_PCA - https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html?highlight=pca#sklearn.decomposition.PCA
self.utils_ts_filtering_STL - https://www.statsmodels.org/devel/generated/statsmodels.tsa.seasonal.STL.html
self.utils_UMAP - https://umap-learn.readthedocs.io/en/latest/

Graphs

This subsclass uses Plotly to plot to different data computed with the module.

CC.graph_boxplots_distances_to_centroid(0)

CC.graph_PCA_3D()

CC.graph_reduction_2D("UMAP")

CC.graph_reduction_2D("PCA")

CC.graph_reduction_density_2D("PCA", 99, "contour")

CC.graph_reduction_density_2D("UMAP", 99, "contour")

CC.graph_reduction_density_2D("PCA", 99, "interactive")

CC.graph_reduction_density_3D(99)

CC.graph_reduction_density_3D(99,clusters=[0,1])

Name		Name	Last commit message	Last commit date
Latest commit History 193 Commits
.github/workflows		.github/workflows
ClustersFeatures		ClustersFeatures
build-docs		build-docs
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitlab-ci.yml		.gitlab-ci.yml
JUPYTERME.ipynb		JUPYTERME.ipynb
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
clustersfeatures-cli.py		clustersfeatures-cli.py
make.bat		make.bat
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Clusters-Features : a Python module to evaluate the quality of clustering

Official documentation : simon-bertrand.github.io/Clusters-Features/

Table of contents

Introduction

Dependencies

Command Line Interface

Import the module

Load a random data set

Data tools

Scores

First method using IndexCore (faster)

Second method using "score_index_" methods

Speed test of different scores

Confusion Hypersphere

self.confusion_hypersphere_matrix

self.confusion_hypersphere_for_linspace_radius_each_element

confusion_hyperphere_around_specific_point_for_two_clusters

Info

Density

self.density_projection_2D

self.density_projection_3D

Utils

Graphs

About

Releases

Packages

Languages

License

Simon-Bertrand/Clusters-Features

Folders and files

Latest commit

History

Repository files navigation

Clusters-Features : a Python module to evaluate the quality of clustering

Official documentation : simon-bertrand.github.io/Clusters-Features/

Table of contents

Introduction

Dependencies

Command Line Interface

Import the module

Load a random data set

Data tools

Scores

First method using IndexCore (faster)

Second method using "score_index_" methods

Speed test of different scores

Confusion Hypersphere

self.confusion_hypersphere_matrix

self.confusion_hypersphere_for_linspace_radius_each_element

confusion_hyperphere_around_specific_point_for_two_clusters

Info

Density

self.density_projection_2D

self.density_projection_3D

Utils

Graphs

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages