Skip to content

Python implementation of the CAIM (class-attribute interdependence maximization) algorithm. Requires Pandas and Numpy.

Notifications You must be signed in to change notification settings

Morgan243/PyCAIM

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

27 Commits
 
 
 
 
 
 
 
 

Repository files navigation

CAIM is a supervised discretization method [1] and Python-CAIM is a Python implementation of CAIM. This is a work in progress, results should be closely inspected. The goal is to provide both a CLI to discretize data for later use as well as a class for programmatic usage. Pull requests welcome.

There is a MATLAB implementation by Guangdi Li and a Java implementation (Research->Data Mining Tool) by the author. The latter being an implementation of the currently unpublished CAIM+ version of the algorithm.

Current Python-CAIM is working on UCI's Musk1 dataset as well as other toy datasets. Results are validated against the Java implementation (see above).

On performance, the Java implementation has notably lower latency (higher performance). This may be due to Java being fundamentally faster than Python, design tricks/shortcuts, or a combination of both. Currently difficult to determine source of improved performance since source code does not appear to be included in the CAIM JAR file. The MatLab version is comparable and often faster for very small datasets. However, Python-CAIM can parallelize discretization, and can thus scale better for datasets with many features.

CLI Options

usage: caim.py [-h] [-t TARGET_FIELD] [-o OUTPUT_PATH] [-H] [-q] input_file

CAIM Algorithm Command Line Tool and Library

positional arguments:
  input_file            CSV input data file

optional arguments:
  -h, --help            show this help message and exit
  -t TARGET_FIELD, --target-field TARGET_FIELD
                        Target field as an integer (0-indexed) or string
                        corresponding to column name. Negative indices (e.g.
                        -1) are allowed.
  -o OUTPUT_PATH, --output-path OUTPUT_PATH
                        File path to write discretized form of data in CSV
                        format
  -H, --header          Use first row as column/field names
  -q, --quiet           Minimal information is printed to STDOUT

Example Usages

Discretize IRIS data

python3 ./caim.py datasets/iris.data -t -1 -H

Discretize IRIS data and save discrete results to iris_caim_data.csv

python3 ./caim.py datasets/iris.data -t -1 -H -o iris_caim.csv

Discretize musk1

python3 ./caim.py datasets/musk_clean1.csv -t -1

Interval Output

Intervals are printed in the form:

[ 0.13  0.34  0.39  0.66]

Which should be interpretted as:

[0.13, 0.34](0.34, 0.39](0.39, 0.66]

The output dataset will use the right-end of each interval as the discretized value.

TODO

  • Fix Unit Tests
  • Continue to re-implement in Pandas/NumPy for speed (avoid loops)
  • Add more test data and corresponding unittests
  • Clean-up API and document

[1] Kurgan, L. and Cios, K.J., 2004. CAIM Discretization Algorithm. IEEE Transactions on Knowledge and Data Engineering, 16(2):145-153

About

Python implementation of the CAIM (class-attribute interdependence maximization) algorithm. Requires Pandas and Numpy.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages