Manifold Learning Visualization (#399)

Manifold learning includes algorithms for random projection of high dimensional data into a 2 or 3-dimensional space for visualization. Unlike PCA or SVD, manifold learning does a better job of embedding points that have non-linear relationships, making them useful for detecting clusters in data and therefore separability for learning algorithms. Because there are a variety of manifold algorithms, this visualizer groups them all into a single visual framework, allowing specification of which model to use by supplying an estimator or a string that refers to the algorithm in question. * Added continuous and discrete colors Created a lightweight method to detect continuous or discrete targets then embedded the color selection methodology into the class. When no target is specified, a single color is used. When a discrete target is detected (<10 unique values) or specified, then resolve colors is used and a legend is added to the plot. When a continuous target is detected or specified, then a colormap is used and a color bar is added to the plot.
DistrictDataLabs · May 18, 2018 · 02f8c27 · 02f8c27
1 parent 46c962d
commit 02f8c27
Show file tree

Hide file tree

Showing 29 changed files with 1,115 additions and 0 deletions.
diff --git a/docs/api/features/images/concrete_isomap_manifold.png b/docs/api/features/images/concrete_isomap_manifold.png
diff --git a/docs/api/features/images/concrete_lle_manifold.png b/docs/api/features/images/concrete_lle_manifold.png
diff --git a/docs/api/features/images/concrete_modified_manifold.png b/docs/api/features/images/concrete_modified_manifold.png
diff --git a/docs/api/features/images/concrete_spectral_manifold.png b/docs/api/features/images/concrete_spectral_manifold.png
diff --git a/docs/api/features/images/concrete_tsne_manifold.png b/docs/api/features/images/concrete_tsne_manifold.png
diff --git a/docs/api/features/images/occupancy_isomap_manifold.png b/docs/api/features/images/occupancy_isomap_manifold.png
diff --git a/docs/api/features/images/occupancy_lle_manifold.png b/docs/api/features/images/occupancy_lle_manifold.png
diff --git a/docs/api/features/images/occupancy_modified_manifold.png b/docs/api/features/images/occupancy_modified_manifold.png
diff --git a/docs/api/features/images/occupancy_select_k_best_isomap_manifold.png b/docs/api/features/images/occupancy_select_k_best_isomap_manifold.png
diff --git a/docs/api/features/images/occupancy_spectral_manifold.png b/docs/api/features/images/occupancy_spectral_manifold.png
diff --git a/docs/api/features/images/occupancy_tsne_manifold.png b/docs/api/features/images/occupancy_tsne_manifold.png
diff --git a/docs/api/features/images/s_curve_hessian_manifold.png b/docs/api/features/images/s_curve_hessian_manifold.png
diff --git a/docs/api/features/images/s_curve_isomap_manifold.png b/docs/api/features/images/s_curve_isomap_manifold.png
diff --git a/docs/api/features/images/s_curve_lle_manifold.png b/docs/api/features/images/s_curve_lle_manifold.png
diff --git a/docs/api/features/images/s_curve_ltsa_manifold.png b/docs/api/features/images/s_curve_ltsa_manifold.png
diff --git a/docs/api/features/images/s_curve_mds_manifold.png b/docs/api/features/images/s_curve_mds_manifold.png
diff --git a/docs/api/features/images/s_curve_modified_manifold.png b/docs/api/features/images/s_curve_modified_manifold.png
diff --git a/docs/api/features/images/s_curve_spectral_manifold.png b/docs/api/features/images/s_curve_spectral_manifold.png
diff --git a/docs/api/features/images/s_curve_tsne_manifold.png b/docs/api/features/images/s_curve_tsne_manifold.png
diff --git a/docs/api/features/index.rst b/docs/api/features/index.rst
@@ -19,6 +19,7 @@ At the moment we have five feature analysis visualizers implemented:
 -  :doc:`pcoords`: plot instances as lines along vertical axes to
    detect classes or clusters
 -  :doc:`pca`: project higher dimensions into a visual space using PCA
+-  :doc:`manifold`: visualize high dimensional data using manifold learning
 -  :doc:`importances`: rank features by relative importance in a model
 -  :doc:`rfecv`: select a subset of features by importance
 -  :doc:`scatter`: plot instances by selecting subsets of features
@@ -39,6 +40,7 @@ is called which displays the image.
     from yellowbrick.features.pcoords import ParallelCoordinates
     from yellowbrick.features.jointplot import JointPlotVisualizer
     from yellowbrick.features.pca import PCADecomposition
+    from yellowbrick.features.manifold import Manifold
     from yellowbrick.features.importances import FeatureImportances
     from yellowbrick.features.rfecv import RFECV
     from yellowbrick.features.scatter import ScatterVisualizer
@@ -51,6 +53,7 @@ is called which displays the image.
    rankd
    pcoords
    pca
+   manifold
    importances
    rfecv
    scatter
diff --git a/docs/api/features/manifold.py b/docs/api/features/manifold.py
@@ -0,0 +1,175 @@
+#!/usr/bin/env python
+# manifold.py
+# Produce images for manifold documentation.
+#
+# Author:  Benjamin Bengfort <[email protected]>
+# Created: Sat May 12 11:26:18 2018 -0400
+#
+# ID: manifold.py [] [email protected] $
+
+"""
+Produce images for manifold documentation.
+"""
+
+##########################################################################
+## Imports
+##########################################################################
+
+import os
+
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from sklearn import datasets
+from sklearn.pipeline import Pipeline
+from sklearn.feature_selection import SelectKBest
+from sklearn.feature_selection import f_classif#, mutual_info_classif
+from yellowbrick.features.manifold import Manifold, MANIFOLD_ALGORITHMS
+
+SKIP = (
+    'ltsa', # produces no result
+    'hessian', # errors because of matrix
+    'mds', # uses way too much memory
+)
+
+FIXTURES = os.path.normpath(os.path.join(
+    os.path.dirname(__file__),
+    "..", "..", "..", "examples", "data"
+))
+
+
+def load_occupancy_data():
+    # Load the classification data set
+    data = pd.read_csv(os.path.join(FIXTURES, 'occupancy', 'occupancy.csv'))
+
+    # Specify the features of interest and the classes of the target
+    features = ["temperature", "relative humidity", "light", "C02", "humidity"]
+
+    X = data[features]
+    y = pd.Series(['occupied' if y == 1 else 'unoccupied' for y in data.occupancy])
+
+    return X, y
+
+
+def load_concrete_data():
+    # Load a regression data set
+    data = pd.read_csv(os.path.join(FIXTURES, 'concrete', 'concrete.csv'))
+
+    # Specify the features of interest
+    feature_names = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
+    target_name = 'strength'
+
+    # Get the X and y data from the DataFrame
+    X = data[feature_names]
+    y = data[target_name]
+
+    return X, y
+
+
+def dataset_example(dataset="occupancy", manifold="all", path="images/"):
+    if manifold == "all":
+        if path is not None and not os.path.isdir(path):
+            "please specify a directory to save examples to"
+
+        for algorithm in MANIFOLD_ALGORITHMS:
+            if algorithm in SKIP: continue
+
+            print("generating {} {} manifold".format(dataset, algorithm))
+            fpath = os.path.join(path, "{}_{}_manifold.png".format(dataset, algorithm))
+            try:
+                dataset_example(dataset, algorithm, fpath)
+            except Exception as e:
+                print("could not visualize {} manifold on {} data: {}".format(algorithm, dataset, e))
+                continue
+
+
+        # Break here!
+        return
+
+    # Create single example
+    _, ax = plt.subplots(figsize=(9,6))
+    oz = Manifold(ax=ax, manifold=manifold)
+
+    if dataset == "occupancy":
+        X, y = load_occupancy_data()
+    elif dataset == "concrete":
+        X, y = load_concrete_data()
+    else:
+        raise Exception("unknown dataset '{}'".format(dataset))
+
+    oz.fit(X, y)
+    oz.poof(outpath=path)
+
+
+def select_features_example(algorithm='isomap', path="images/occupancy_select_k_best_isomap_manifold.png"):
+    _, ax = plt.subplots(figsize=(9,6))
+
+    model = Pipeline([
+        ("selectk", SelectKBest(k=3, score_func=f_classif)),
+        ("viz", Manifold(ax=ax, manifold=algorithm)),
+    ])
+
+    X, y = load_occupancy_data()
+    model.fit(X, y)
+    model.named_steps['viz'].poof(outpath=path)
+
+
+class SCurveExample(object):
+    """
+    Creates an S-curve example and multiple visualizations
+    """
+
+    def __init__(self, n_points=1000, random_state=42):
+        self.X, self.y = datasets.samples_generator.make_s_curve(
+            n_points, random_state=random_state
+        )
+
+    def _make_path(self, path, name):
+        """
+        Makes directories as needed
+        """
+        if not os.path.exists(path):
+            os.mkdirs(path)
+
+        if os.path.isdir(path) :
+            return os.path.join(path, name)
+
+        return path
+
+    def plot_original_3d(self, path="images"):
+        """
+        Plot the original data in 3-dimensional space
+        """
+        raise NotImplementedError("nyi")
+
+    def plot_manifold_embedding(self, algorithm="lle", path="images"):
+        """
+        Draw the manifold embedding for the specified algorithm
+        """
+        _, ax = plt.subplots(figsize=(9,6))
+        path = self._make_path(path, "s_curve_{}_manifold.png".format(algorithm))
+
+        oz = Manifold(
+            ax=ax, manifold=algorithm,
+            target='continuous', colors='nipy_spectral'
+        )
+
+        oz.fit(self.X, self.y)
+        oz.poof(outpath=path)
+
+    def plot_all_manifolds(self, path="images"):
+        """
+        Plot all s-curve examples
+        """
+        for algorithm in MANIFOLD_ALGORITHMS:
+            self.plot_manifold_embedding(algorithm)
+
+
+if __name__ == '__main__':
+    # curve = SCurveExample()
+    # curve.plot_all_manifolds()
+
+    dataset_example('occupancy', 'tsne', path="images/occupancy_tsne_manifold.png")
+    # dataset_example('concrete', 'all')
+
+    # select_features_example()
diff --git a/docs/api/features/manifold.rst b/docs/api/features/manifold.rst
@@ -0,0 +1,161 @@
+.. -*- mode: rst -*-
+
+Manifold Visualization
+======================
+
+The ``Manifold`` visualizer provides high dimensional visualization using
+`manifold learning`_
+to embed instances described by many dimensions into 2, thus allowing the
+creation of a scatter plot that shows latent structures in data. Unlike
+decomposition methods such as PCA and SVD, manifolds generally use
+nearest-neighbors approaches to embedding, allowing them to capture non-linear
+structures that would be otherwise lost. The projections that are produced
+can then be analyzed for noise or separability to determine if it is possible
+to create a decision space in the data.
+
+.. image:: images/concrete_tsne_manifold.png
+
+The ``Manifold`` visualizer allows access to all currently available
+scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows:
+
+==============  ============================================================
+Manifold        Description
+--------------  ------------------------------------------------------------
+``"lle"``       `Locally Linear Embedding`_ (LLE) uses many local linear
+                decompositions to preserve globally non-linear structures.
+``"ltsa"``      `LTSA LLE`_: local tangent space alignment is similar to LLE
+                in that it uses locality to preserve neighborhood distances.
+``"hessian"``   `Hessian LLE`_ an LLE regularization method that applies a
+                hessian-based quadratic form at each neighborhood
+``"modified"``  `Modified LLE`_ applies a regularization parameter to LLE.
+``"isomap"``    `Isomap`_ seeks a lower dimensional embedding that maintains
+                geometric distances between each instance.
+``"mds"``       `MDS`_: multi-dimensional scaling uses similarity to plot
+                points that are near to each other close in the embedding.
+``"spectral"``  `Spectral Embedding`_ a discrete approximation of the low
+                dimensional manifold using a graph representation.
+``"tsne"``      `t-SNE`_: converts the similarity of points into probabilities
+                then uses those probabilities to create an embedding.
+==============  ============================================================
+
+Each manifold algorithm produces a different embedding and takes advantage of
+different properties of the underlying data. Generally speaking, it requires
+multiple attempts on new data to determine the manifold that works best for
+the structures latent in your data. Note however, that different manifold
+algorithms have different time, complexity, and resource requirements.
+
+Manifolds can be used on many types of problems, and the color used in the
+scatter plot can describe the target instance. In an unsupervised or
+clustering problem, a single color is used to show structure and overlap. In
+a classification problem discrete colors are used for each class. In a
+regression problem, a color map can be used to describe points as a heat map
+of their regression values.
+
+Discrete Target
+---------------
+
+In a classification or clustering problem, the instances can be described by
+discrete labels - the classes or categories in the supervised problem, or the
+clusters they belong to in the unsupervised version. The manifold visualizes
+this by assigning a color to each label and showing the labels in a legend.
+
+.. code:: python
+
+    # Load the classification data set
+    data = load_data('occupancy')
+
+    # Specify the features of interest
+    features = [
+        "temperature", "relative humidity", "light", "C02", "humidity"
+    ]
+
+    # Extract the data from the data frame.
+    X = data[features]
+    y = data.occupancy
+
+.. code:: python
+
+    from yellowbrick.features.manifold import Manifold
+
+    visualizer = Manifold(manifold='tsne', target='discrete')
+    visualizer.fit_transform(X,y)
+    visualizer.poof()
+
+
+.. image:: images/occupancy_tsne_manifold.png
+
+The visualization also displays the amount of time it takes to generate the
+embedding; as you can see, this can take a long time even for relatively
+small datasets. One tip is scale your data using the ``StandardScalar``;
+another is to sample your instances (e.g. using ``train_test_split`` to
+preserve class stratification) or to filter features to decrease sparsity in
+the dataset.
+
+One common mechanism is to use `SelectKBest` to select the features that have
+a statistical correlation with the target dataset. For example, we can use
+the ``f_classif`` score to find the 3 best features in our occupancy dataset.
+
+.. code:: python
+
+    from sklearn.pipeline import Pipeline
+    from sklearn.feature_selection import SelectKBest
+    from sklearn.feature_selection import f_classif
+
+    model = Pipeline([
+        ("selectk", SelectKBest(k=3, score_func=f_classif)),
+        ("viz", Manifold(manifold='isomap', target='discrete')),
+    ])
+
+    X, y = load_occupancy_data()
+    model.fit(X, y)
+    model.named_steps['viz'].poof()
+
+.. image:: images/occupancy_select_k_best_isomap_manifold.png
+
+Continuous Target
+-----------------
+
+For a regression target or to specify color as a heat-map of continuous
+values, specify ``target='continuous'``. Note that by default the param
+``target='auto'`` is set, which determines if the target is discrete or
+continuous by counting the number of unique values in ``y``.
+
+.. code:: python
+
+    # Specify the features of interest
+    feature_names = [
+        'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
+    ]
+    target_name = 'strength'
+
+    # Get the X and y data from the DataFrame
+    X = data[feature_names]
+    y = data[target_name]
+
+.. code:: python
+
+    visualizer = Manifold(manifold='isomap', target='continuous')
+    visualizer.fit_transform(X,y)
+    visualizer.poof()
+
+.. image:: images/concrete_isomap_manifold.png
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.features.manifold
+    :members: Manifold
+    :undoc-members:
+    :show-inheritance:
+
+
+.. _`manifold learning`: http://scikit-learn.org/stable/modules/manifold.html
+.. _`manifold comparisons`: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html
+.. _`Locally Linear Embedding`: http://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding
+.. _`LTSA LLE`: http://scikit-learn.org/stable/modules/manifold.html#local-tangent-space-alignment
+.. _`Hessian LLE`: http://scikit-learn.org/stable/modules/manifold.html#hessian-eigenmapping>
+.. _`Modified LLE`: http://scikit-learn.org/stable/modules/manifold.html#modified-locally-linear-embedding
+.. _`Isomap`: http://scikit-learn.org/stable/modules/manifold.html#isomap
+.. _`MDS`: http://scikit-learn.org/stable/modules/manifold.html#multi-dimensional-scaling-mds
+.. _`Spectral Embedding`: http://scikit-learn.org/stable/modules/manifold.html#spectral-embedding
+.. _`t-SNE`: http://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne
diff --git a/docs/index.rst b/docs/index.rst
@@ -25,6 +25,7 @@ Feature Visualization
 - :doc:`api/features/pcoords`: horizontal visualization of instances
 - :doc:`Radial Visualization <api/features/radviz>`: separation of instances around a circular plot
 - :doc:`api/features/pca`: projection of instances based on principal components
+- :doc:`api/features/manifold`: high dimensional visualization with manifold learning
 - :doc:`api/features/importances`: rank features by importance or linear coefficients for a specific model
 - :doc:`api/features/rfecv`: find the best subset of features based on importance
 - :doc:`Scatter and Joint Plots<api/features/scatter>`: direct data visualization with feature selection

diff --git a/tests/baseline_images/test_features/test_manifold/test_manifold_classification.png b/tests/baseline_images/test_features/test_manifold/test_manifold_classification.png
diff --git a/tests/baseline_images/test_features/test_manifold/test_manifold_pandas.png b/tests/baseline_images/test_features/test_manifold/test_manifold_pandas.png
diff --git a/tests/baseline_images/test_features/test_manifold/test_manifold_regression.png b/tests/baseline_images/test_features/test_manifold/test_manifold_regression.png
diff --git a/tests/baseline_images/test_features/test_manifold/test_manifold_single.png b/tests/baseline_images/test_features/test_manifold/test_manifold_single.png