Skip to content

Commit

Permalink
Manifold Learning Visualization (#399)
Browse files Browse the repository at this point in the history
Manifold learning includes algorithms for random projection of high
dimensional data into a 2 or 3-dimensional space for visualization.
Unlike PCA or SVD, manifold learning does a better job of embedding
points that have non-linear relationships, making them useful for
detecting clusters in data and therefore separability for learning
algorithms.

Because there are a variety of manifold algorithms, this visualizer
groups them all into a single visual framework, allowing specification
of which model to use by supplying an estimator or a string that refers
to the algorithm in question.

* Added continuous and discrete colors

Created a lightweight method to detect continuous or discrete targets
then embedded the color selection methodology into the class. When no
target is specified, a single color is used. When a discrete target is
detected (<10 unique values) or specified, then resolve colors is used
and a legend is added to the plot. When a continuous target is detected
or specified, then a colormap is used and a color bar is added to the
plot.
  • Loading branch information
bbengfort authored May 18, 2018
1 parent 46c962d commit 02f8c27
Show file tree
Hide file tree
Showing 29 changed files with 1,115 additions and 0 deletions.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions docs/api/features/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ At the moment we have five feature analysis visualizers implemented:
- :doc:`pcoords`: plot instances as lines along vertical axes to
detect classes or clusters
- :doc:`pca`: project higher dimensions into a visual space using PCA
- :doc:`manifold`: visualize high dimensional data using manifold learning
- :doc:`importances`: rank features by relative importance in a model
- :doc:`rfecv`: select a subset of features by importance
- :doc:`scatter`: plot instances by selecting subsets of features
Expand All @@ -39,6 +40,7 @@ is called which displays the image.
from yellowbrick.features.pcoords import ParallelCoordinates
from yellowbrick.features.jointplot import JointPlotVisualizer
from yellowbrick.features.pca import PCADecomposition
from yellowbrick.features.manifold import Manifold
from yellowbrick.features.importances import FeatureImportances
from yellowbrick.features.rfecv import RFECV
from yellowbrick.features.scatter import ScatterVisualizer
Expand All @@ -51,6 +53,7 @@ is called which displays the image.
rankd
pcoords
pca
manifold
importances
rfecv
scatter
175 changes: 175 additions & 0 deletions docs/api/features/manifold.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,175 @@
#!/usr/bin/env python
# manifold.py
# Produce images for manifold documentation.
#
# Author: Benjamin Bengfort <[email protected]>
# Created: Sat May 12 11:26:18 2018 -0400
#
# ID: manifold.py [] [email protected] $

"""
Produce images for manifold documentation.
"""

##########################################################################
## Imports
##########################################################################

import os

import pandas as pd
import matplotlib.pyplot as plt

from sklearn import datasets
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif#, mutual_info_classif
from yellowbrick.features.manifold import Manifold, MANIFOLD_ALGORITHMS

SKIP = (
'ltsa', # produces no result
'hessian', # errors because of matrix
'mds', # uses way too much memory
)

FIXTURES = os.path.normpath(os.path.join(
os.path.dirname(__file__),
"..", "..", "..", "examples", "data"
))


def load_occupancy_data():
# Load the classification data set
data = pd.read_csv(os.path.join(FIXTURES, 'occupancy', 'occupancy.csv'))

# Specify the features of interest and the classes of the target
features = ["temperature", "relative humidity", "light", "C02", "humidity"]

X = data[features]
y = pd.Series(['occupied' if y == 1 else 'unoccupied' for y in data.occupancy])

return X, y


def load_concrete_data():
# Load a regression data set
data = pd.read_csv(os.path.join(FIXTURES, 'concrete', 'concrete.csv'))

# Specify the features of interest
feature_names = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age']
target_name = 'strength'

# Get the X and y data from the DataFrame
X = data[feature_names]
y = data[target_name]

return X, y


def dataset_example(dataset="occupancy", manifold="all", path="images/"):
if manifold == "all":
if path is not None and not os.path.isdir(path):
"please specify a directory to save examples to"

for algorithm in MANIFOLD_ALGORITHMS:
if algorithm in SKIP: continue

print("generating {} {} manifold".format(dataset, algorithm))
fpath = os.path.join(path, "{}_{}_manifold.png".format(dataset, algorithm))
try:
dataset_example(dataset, algorithm, fpath)
except Exception as e:
print("could not visualize {} manifold on {} data: {}".format(algorithm, dataset, e))
continue


# Break here!
return

# Create single example
_, ax = plt.subplots(figsize=(9,6))
oz = Manifold(ax=ax, manifold=manifold)

if dataset == "occupancy":
X, y = load_occupancy_data()
elif dataset == "concrete":
X, y = load_concrete_data()
else:
raise Exception("unknown dataset '{}'".format(dataset))

oz.fit(X, y)
oz.poof(outpath=path)


def select_features_example(algorithm='isomap', path="images/occupancy_select_k_best_isomap_manifold.png"):
_, ax = plt.subplots(figsize=(9,6))

model = Pipeline([
("selectk", SelectKBest(k=3, score_func=f_classif)),
("viz", Manifold(ax=ax, manifold=algorithm)),
])

X, y = load_occupancy_data()
model.fit(X, y)
model.named_steps['viz'].poof(outpath=path)


class SCurveExample(object):
"""
Creates an S-curve example and multiple visualizations
"""

def __init__(self, n_points=1000, random_state=42):
self.X, self.y = datasets.samples_generator.make_s_curve(
n_points, random_state=random_state
)

def _make_path(self, path, name):
"""
Makes directories as needed
"""
if not os.path.exists(path):
os.mkdirs(path)

if os.path.isdir(path) :
return os.path.join(path, name)

return path

def plot_original_3d(self, path="images"):
"""
Plot the original data in 3-dimensional space
"""
raise NotImplementedError("nyi")

def plot_manifold_embedding(self, algorithm="lle", path="images"):
"""
Draw the manifold embedding for the specified algorithm
"""
_, ax = plt.subplots(figsize=(9,6))
path = self._make_path(path, "s_curve_{}_manifold.png".format(algorithm))

oz = Manifold(
ax=ax, manifold=algorithm,
target='continuous', colors='nipy_spectral'
)

oz.fit(self.X, self.y)
oz.poof(outpath=path)

def plot_all_manifolds(self, path="images"):
"""
Plot all s-curve examples
"""
for algorithm in MANIFOLD_ALGORITHMS:
self.plot_manifold_embedding(algorithm)


if __name__ == '__main__':
# curve = SCurveExample()
# curve.plot_all_manifolds()

dataset_example('occupancy', 'tsne', path="images/occupancy_tsne_manifold.png")
# dataset_example('concrete', 'all')

# select_features_example()
161 changes: 161 additions & 0 deletions docs/api/features/manifold.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,161 @@
.. -*- mode: rst -*-
Manifold Visualization
======================

The ``Manifold`` visualizer provides high dimensional visualization using
`manifold learning`_
to embed instances described by many dimensions into 2, thus allowing the
creation of a scatter plot that shows latent structures in data. Unlike
decomposition methods such as PCA and SVD, manifolds generally use
nearest-neighbors approaches to embedding, allowing them to capture non-linear
structures that would be otherwise lost. The projections that are produced
can then be analyzed for noise or separability to determine if it is possible
to create a decision space in the data.

.. image:: images/concrete_tsne_manifold.png

The ``Manifold`` visualizer allows access to all currently available
scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows:

============== ============================================================
Manifold Description
-------------- ------------------------------------------------------------
``"lle"`` `Locally Linear Embedding`_ (LLE) uses many local linear
decompositions to preserve globally non-linear structures.
``"ltsa"`` `LTSA LLE`_: local tangent space alignment is similar to LLE
in that it uses locality to preserve neighborhood distances.
``"hessian"`` `Hessian LLE`_ an LLE regularization method that applies a
hessian-based quadratic form at each neighborhood
``"modified"`` `Modified LLE`_ applies a regularization parameter to LLE.
``"isomap"`` `Isomap`_ seeks a lower dimensional embedding that maintains
geometric distances between each instance.
``"mds"`` `MDS`_: multi-dimensional scaling uses similarity to plot
points that are near to each other close in the embedding.
``"spectral"`` `Spectral Embedding`_ a discrete approximation of the low
dimensional manifold using a graph representation.
``"tsne"`` `t-SNE`_: converts the similarity of points into probabilities
then uses those probabilities to create an embedding.
============== ============================================================

Each manifold algorithm produces a different embedding and takes advantage of
different properties of the underlying data. Generally speaking, it requires
multiple attempts on new data to determine the manifold that works best for
the structures latent in your data. Note however, that different manifold
algorithms have different time, complexity, and resource requirements.

Manifolds can be used on many types of problems, and the color used in the
scatter plot can describe the target instance. In an unsupervised or
clustering problem, a single color is used to show structure and overlap. In
a classification problem discrete colors are used for each class. In a
regression problem, a color map can be used to describe points as a heat map
of their regression values.

Discrete Target
---------------

In a classification or clustering problem, the instances can be described by
discrete labels - the classes or categories in the supervised problem, or the
clusters they belong to in the unsupervised version. The manifold visualizes
this by assigning a color to each label and showing the labels in a legend.

.. code:: python
# Load the classification data set
data = load_data('occupancy')
# Specify the features of interest
features = [
"temperature", "relative humidity", "light", "C02", "humidity"
]
# Extract the data from the data frame.
X = data[features]
y = data.occupancy
.. code:: python
from yellowbrick.features.manifold import Manifold
visualizer = Manifold(manifold='tsne', target='discrete')
visualizer.fit_transform(X,y)
visualizer.poof()
.. image:: images/occupancy_tsne_manifold.png

The visualization also displays the amount of time it takes to generate the
embedding; as you can see, this can take a long time even for relatively
small datasets. One tip is scale your data using the ``StandardScalar``;
another is to sample your instances (e.g. using ``train_test_split`` to
preserve class stratification) or to filter features to decrease sparsity in
the dataset.

One common mechanism is to use `SelectKBest` to select the features that have
a statistical correlation with the target dataset. For example, we can use
the ``f_classif`` score to find the 3 best features in our occupancy dataset.

.. code:: python
from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import f_classif
model = Pipeline([
("selectk", SelectKBest(k=3, score_func=f_classif)),
("viz", Manifold(manifold='isomap', target='discrete')),
])
X, y = load_occupancy_data()
model.fit(X, y)
model.named_steps['viz'].poof()
.. image:: images/occupancy_select_k_best_isomap_manifold.png

Continuous Target
-----------------

For a regression target or to specify color as a heat-map of continuous
values, specify ``target='continuous'``. Note that by default the param
``target='auto'`` is set, which determines if the target is discrete or
continuous by counting the number of unique values in ``y``.

.. code:: python
# Specify the features of interest
feature_names = [
'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'
]
target_name = 'strength'
# Get the X and y data from the DataFrame
X = data[feature_names]
y = data[target_name]
.. code:: python
visualizer = Manifold(manifold='isomap', target='continuous')
visualizer.fit_transform(X,y)
visualizer.poof()
.. image:: images/concrete_isomap_manifold.png

API Reference
-------------

.. automodule:: yellowbrick.features.manifold
:members: Manifold
:undoc-members:
:show-inheritance:


.. _`manifold learning`: http://scikit-learn.org/stable/modules/manifold.html
.. _`manifold comparisons`: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html
.. _`Locally Linear Embedding`: http://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding
.. _`LTSA LLE`: http://scikit-learn.org/stable/modules/manifold.html#local-tangent-space-alignment
.. _`Hessian LLE`: http://scikit-learn.org/stable/modules/manifold.html#hessian-eigenmapping>
.. _`Modified LLE`: http://scikit-learn.org/stable/modules/manifold.html#modified-locally-linear-embedding
.. _`Isomap`: http://scikit-learn.org/stable/modules/manifold.html#isomap
.. _`MDS`: http://scikit-learn.org/stable/modules/manifold.html#multi-dimensional-scaling-mds
.. _`Spectral Embedding`: http://scikit-learn.org/stable/modules/manifold.html#spectral-embedding
.. _`t-SNE`: http://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne
1 change: 1 addition & 0 deletions docs/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ Feature Visualization
- :doc:`api/features/pcoords`: horizontal visualization of instances
- :doc:`Radial Visualization <api/features/radviz>`: separation of instances around a circular plot
- :doc:`api/features/pca`: projection of instances based on principal components
- :doc:`api/features/manifold`: high dimensional visualization with manifold learning
- :doc:`api/features/importances`: rank features by importance or linear coefficients for a specific model
- :doc:`api/features/rfecv`: find the best subset of features based on importance
- :doc:`Scatter and Joint Plots<api/features/scatter>`: direct data visualization with feature selection
Expand Down
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 02f8c27

Please sign in to comment.