Skip to content

fsavje/scclust

Repository files navigation

scclust

scclust is used for size-constrained clustering. It solves clustering problems where one wants to partition a set of data points into groups subject to constraints on the composition of the groups. In particular, scclust tries to minimize the dissimilarity of points assigned to the same cluster subject to user-specified constraints. It is possible to specify overall size constraints and constraints specific to types of data points. For example, in a data set with "red" and "blue" data points, one can restrict each cluster to contain at least five points in total of which at least two must be "red".

scclust is made with large data sets in mind, and it can cluster millions of data points in less than a minute. It's also fairly easy to adapt scclust to run with exotic databases or other distance metrics as all data management can be accessed and changed at runtime.

scclust is written in C99 and requires only the standard C library to compile.

How to compile scclust

The following code downloads and compiles the current development version of scclust:

wget https://github.com/fsavje/scclust/archive/master.zip
unzip master.zip
cd scclust-master
./configure
make

scclust compiles into a static library.

How to use scclust

Details on how to use scclust is best found in the documentation, the library header and the examples distributed with the code. For the impatient, the following snippet shows a simple use case: we have a set of data points and want to construct clusters that contain at least three points each.

#include <stdbool.h>
#include <stddef.h>
#include <stdio.h>
#include <scclust.h>

int main(void) {

	// Data: 10 two-dimensional points
	double raw_data[20] = { 0.088316633,  0.807443027,
	                       -1.080004390,  0.969638235,
	                        1.316503268,  1.492648875,
	                        0.140829264, -0.100874408,
	                       -0.832028346,  0.655380698,
	                        1.392129281,  0.885861553,
	                       -2.764608404, -0.449692089,
	                        0.296104717,  0.943822811,
	                       -0.641513488,  0.417153186,
	                        1.509341003, -0.026534633 };

	// Error code variable
	scc_ErrorCode ec;

	// Construct scclust data set object
	scc_DataSet* data_set;
	ec = scc_init_data_set(10,           // Number of data points
	                       2,            // Number of dimensions
	                       20,           // Length of data matrix
	                       raw_data,     // Data matrix
	                       &data_set);   // Data set to initialize
	// Check error code
	if(ec != SCC_ER_OK) return 1;

	// Make empty clustering object
	scc_Clabel cluster_labels[10];
	scc_Clustering* clustering;
	ec = scc_init_empty_clustering(10,               // Number of data points
	                               cluster_labels,   // Clustering labels
	                               &clustering);     // Clustering to initialize
	if(ec != SCC_ER_OK) return 1;

	// Set clustering options (start with defaults)
	scc_ClusterOptions options = scc_get_default_options();
	// At least 3 data points in each cluster
	options.size_constraint = 3;

	// Make clustering
	ec = scc_sc_clustering(data_set, &options, clustering);
	if(ec != SCC_ER_OK) return 1;

	// Print clustering
	printf("Cluster assignment:\n");
	for(size_t i = 0; i < 10; ++i) {
		printf("%d ", cluster_labels[i]);
	}
	printf("\n");

	// Free clustering and data set objects
	scc_free_clustering(&clustering);
	scc_free_data_set(&data_set);
}

This is the output the program generates:

Cluster assignment:
0 2 1 0 2 1 2 0 0 1

That is, the first point is assigned to cluster "0", the second to cluster "2" and so on.

The example is distributed with the library. After compiling scclust itself, you can compile and run it by calling the following in the examples/simple folder.

make
./simple_example.out

Compilation options

scclust accepts several compilation options as flags to the configure script:

Usage: ./configure [OPTIONS]...

  -h, --help                display this help and exit
  --version                 display version information and exit
  -q, --quiet               do not print messages

  --enable-FEATURE          include FEATURE
  --disable-FEATURE         do not include FEATURE

  --enable-assert           enable ASSERT checking [default=off]
  --enable-digraph-debug    enable debug functions for digraphs [default=off]
  --enable-cmocka-headers   use cmocka allocation functions [default=off]
  --enable-documentation    make documentation [default=off]
  --enable-all-docs         make documentation for internal methods [default=off]

  --with-clabel=[ARG]       cluster label type [default=uint32_t]
  --with-clabel-na=[ARG]    cluster label NA value [default=max]
  --with-typelabel=[ARG]    type label type [default=uint_fast16_t]
  --with-pointindex=[ARG]   data point ID type [default=uint32_t]
  --with-arcindex=[ARG]     digraph arc type [default=uint32_t]

--[enable/disable]-assert

Default: --disable-assert

Compiles with debug asserts. This will slow down execution, but is helpful to catch bugs.

--[enable/disable]-digraph-debug

Default: --disable-digraph-debug

Compiles with debug support for the internal digraph object. This is only useful when debugging.

--[enable/disable]-cmocka-headers

Default: --disable-cmocka-headers

Compiles with support for cmocka's internal allocation and assert functions which watches for overflows and memory leaks. This is only useful when testing during development and is not recommended for production use.

Requires the cmocka library.

--[enable/disable]-documentation

Default: --disable-documentation

Compiles documentation.

Requires the doxygen library.

--[enable/disable]-all-docs

Default: --disable-all-docs

By default, scclust only makes documentation for the public header methods. With this option, one can make documentation for internal methods as well. This can be useful during development.

--with-clabel=[ARG]

Allowed values: uint32_t uint64_t int

Default: uint32_t

Change the data type that stores cluster labels. scclust uses this type to report cluster assignment.

--with-clabel-na=[ARG]

Allowed values: max min -1 -2 -3 ...

Default: max

Change the value used to denote unassigned data points.

--with-clabel-na= Value
max Use maximum value storable in --with-clabel
min Use minimum value, --with-clabel must be signed
-1 -2 -3 ... Use [ARG], --with-clabel must be signed

--with-typelabel=[ARG]

Allowed values: uint_fast16_t int

Default: uint_fast16_t

Change the data type that stores data point type labels. This type is used to denote which type data points are.

--with-pointindex=[ARG]

Allowed values: uint32_t uint64_t int

Default: uint32_t

Change the data type that stores point indices. This choice restricts the maximum number of points in any clustering problem solved by scclust. A wider type allows for larger problems but requires more memory. The maximum number of data points is given by:

--with-pointindex= Max data points
uint32_t 2^32 − 1
uint64_t 2^64 − 1
int 2^31 − 1

--with-arcindex=[ARG]

Allowed values: uint32_t uint64_t

Default: uint32_t

Change the data type that stores arc indices. This choice restricts the size of the graphs used by scclust to solve clustering problems. A wider type allows for larger problems but requires more memory. A rough estimate of the maximum number of arcs is given by {number of points} x {minimum size of clusters}. The maximum number of arcs is given by:

--with-arcindex= Max arcs
uint32_t 2^32 − 1
uint64_t 2^64 − 1

Service Provider Interface (SPI)

The main functionality of scclust is agnostic to how the data points are stored. The library comes with a data structure that stores the points in a floating point array, and a set of functions to access this data. It is possible to change these functions at runtime so that other data structure can be used for point storage. This can be useful when extending scclust to accept other databases or if one wants to use particular functions to calculate the distances. In particular, scclust ships with a simple nearest neighbor search algorithm; performance can often be improved drastically by using a dedicated nearest neighbor search library.

See include/scclust_spi.h and src/dist_search.h for the distance functions that can be exchanged. Note that if the new functions accepts the scc_DataSet struct as input (see src/data_set_struct.h), one can swap only parts of the distance functions.

See examples/ann/ for an example where the ANN library is used for nearest neighbor search. (It is recommended to compile scclust with the --with-pointindex=int option when using the ANN wrapper. This avoids costly type translations between the libraries.)

How to contribute

Thank you for considering contributing to scclust!

There are many ways to help out: report bugs, suggest new features and submitting code that implements enhancements and bug fixes. If possible, use Github's internal tools for to do so: issues for bug reports and suggestions, and pull requests for code. If you're new to Github, read this guide to learn more.

If you're filing a bug, please include all information needed to reproduce it. Besides the code you're trying to run, your platform, toolchain and compilation options is often useful information.

License

scclust is distributed under the GNU Lesser General Public License v2.1.

About

C library for size-constrained clustering

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages