-
-
Notifications
You must be signed in to change notification settings - Fork 558
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Manifold Learning Visualization (#399)
Manifold learning includes algorithms for random projection of high dimensional data into a 2 or 3-dimensional space for visualization. Unlike PCA or SVD, manifold learning does a better job of embedding points that have non-linear relationships, making them useful for detecting clusters in data and therefore separability for learning algorithms. Because there are a variety of manifold algorithms, this visualizer groups them all into a single visual framework, allowing specification of which model to use by supplying an estimator or a string that refers to the algorithm in question. * Added continuous and discrete colors Created a lightweight method to detect continuous or discrete targets then embedded the color selection methodology into the class. When no target is specified, a single color is used. When a discrete target is detected (<10 unique values) or specified, then resolve colors is used and a legend is added to the plot. When a continuous target is detected or specified, then a colormap is used and a color bar is added to the plot.
- Loading branch information
Showing
29 changed files
with
1,115 additions
and
0 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,175 @@ | ||
#!/usr/bin/env python | ||
# manifold.py | ||
# Produce images for manifold documentation. | ||
# | ||
# Author: Benjamin Bengfort <[email protected]> | ||
# Created: Sat May 12 11:26:18 2018 -0400 | ||
# | ||
# ID: manifold.py [] [email protected] $ | ||
|
||
""" | ||
Produce images for manifold documentation. | ||
""" | ||
|
||
########################################################################## | ||
## Imports | ||
########################################################################## | ||
|
||
import os | ||
|
||
import pandas as pd | ||
import matplotlib.pyplot as plt | ||
|
||
from sklearn import datasets | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.feature_selection import SelectKBest | ||
from sklearn.feature_selection import f_classif#, mutual_info_classif | ||
from yellowbrick.features.manifold import Manifold, MANIFOLD_ALGORITHMS | ||
|
||
SKIP = ( | ||
'ltsa', # produces no result | ||
'hessian', # errors because of matrix | ||
'mds', # uses way too much memory | ||
) | ||
|
||
FIXTURES = os.path.normpath(os.path.join( | ||
os.path.dirname(__file__), | ||
"..", "..", "..", "examples", "data" | ||
)) | ||
|
||
|
||
def load_occupancy_data(): | ||
# Load the classification data set | ||
data = pd.read_csv(os.path.join(FIXTURES, 'occupancy', 'occupancy.csv')) | ||
|
||
# Specify the features of interest and the classes of the target | ||
features = ["temperature", "relative humidity", "light", "C02", "humidity"] | ||
|
||
X = data[features] | ||
y = pd.Series(['occupied' if y == 1 else 'unoccupied' for y in data.occupancy]) | ||
|
||
return X, y | ||
|
||
|
||
def load_concrete_data(): | ||
# Load a regression data set | ||
data = pd.read_csv(os.path.join(FIXTURES, 'concrete', 'concrete.csv')) | ||
|
||
# Specify the features of interest | ||
feature_names = ['cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age'] | ||
target_name = 'strength' | ||
|
||
# Get the X and y data from the DataFrame | ||
X = data[feature_names] | ||
y = data[target_name] | ||
|
||
return X, y | ||
|
||
|
||
def dataset_example(dataset="occupancy", manifold="all", path="images/"): | ||
if manifold == "all": | ||
if path is not None and not os.path.isdir(path): | ||
"please specify a directory to save examples to" | ||
|
||
for algorithm in MANIFOLD_ALGORITHMS: | ||
if algorithm in SKIP: continue | ||
|
||
print("generating {} {} manifold".format(dataset, algorithm)) | ||
fpath = os.path.join(path, "{}_{}_manifold.png".format(dataset, algorithm)) | ||
try: | ||
dataset_example(dataset, algorithm, fpath) | ||
except Exception as e: | ||
print("could not visualize {} manifold on {} data: {}".format(algorithm, dataset, e)) | ||
continue | ||
|
||
|
||
# Break here! | ||
return | ||
|
||
# Create single example | ||
_, ax = plt.subplots(figsize=(9,6)) | ||
oz = Manifold(ax=ax, manifold=manifold) | ||
|
||
if dataset == "occupancy": | ||
X, y = load_occupancy_data() | ||
elif dataset == "concrete": | ||
X, y = load_concrete_data() | ||
else: | ||
raise Exception("unknown dataset '{}'".format(dataset)) | ||
|
||
oz.fit(X, y) | ||
oz.poof(outpath=path) | ||
|
||
|
||
def select_features_example(algorithm='isomap', path="images/occupancy_select_k_best_isomap_manifold.png"): | ||
_, ax = plt.subplots(figsize=(9,6)) | ||
|
||
model = Pipeline([ | ||
("selectk", SelectKBest(k=3, score_func=f_classif)), | ||
("viz", Manifold(ax=ax, manifold=algorithm)), | ||
]) | ||
|
||
X, y = load_occupancy_data() | ||
model.fit(X, y) | ||
model.named_steps['viz'].poof(outpath=path) | ||
|
||
|
||
class SCurveExample(object): | ||
""" | ||
Creates an S-curve example and multiple visualizations | ||
""" | ||
|
||
def __init__(self, n_points=1000, random_state=42): | ||
self.X, self.y = datasets.samples_generator.make_s_curve( | ||
n_points, random_state=random_state | ||
) | ||
|
||
def _make_path(self, path, name): | ||
""" | ||
Makes directories as needed | ||
""" | ||
if not os.path.exists(path): | ||
os.mkdirs(path) | ||
|
||
if os.path.isdir(path) : | ||
return os.path.join(path, name) | ||
|
||
return path | ||
|
||
def plot_original_3d(self, path="images"): | ||
""" | ||
Plot the original data in 3-dimensional space | ||
""" | ||
raise NotImplementedError("nyi") | ||
|
||
def plot_manifold_embedding(self, algorithm="lle", path="images"): | ||
""" | ||
Draw the manifold embedding for the specified algorithm | ||
""" | ||
_, ax = plt.subplots(figsize=(9,6)) | ||
path = self._make_path(path, "s_curve_{}_manifold.png".format(algorithm)) | ||
|
||
oz = Manifold( | ||
ax=ax, manifold=algorithm, | ||
target='continuous', colors='nipy_spectral' | ||
) | ||
|
||
oz.fit(self.X, self.y) | ||
oz.poof(outpath=path) | ||
|
||
def plot_all_manifolds(self, path="images"): | ||
""" | ||
Plot all s-curve examples | ||
""" | ||
for algorithm in MANIFOLD_ALGORITHMS: | ||
self.plot_manifold_embedding(algorithm) | ||
|
||
|
||
if __name__ == '__main__': | ||
# curve = SCurveExample() | ||
# curve.plot_all_manifolds() | ||
|
||
dataset_example('occupancy', 'tsne', path="images/occupancy_tsne_manifold.png") | ||
# dataset_example('concrete', 'all') | ||
|
||
# select_features_example() |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,161 @@ | ||
.. -*- mode: rst -*- | ||
Manifold Visualization | ||
====================== | ||
|
||
The ``Manifold`` visualizer provides high dimensional visualization using | ||
`manifold learning`_ | ||
to embed instances described by many dimensions into 2, thus allowing the | ||
creation of a scatter plot that shows latent structures in data. Unlike | ||
decomposition methods such as PCA and SVD, manifolds generally use | ||
nearest-neighbors approaches to embedding, allowing them to capture non-linear | ||
structures that would be otherwise lost. The projections that are produced | ||
can then be analyzed for noise or separability to determine if it is possible | ||
to create a decision space in the data. | ||
|
||
.. image:: images/concrete_tsne_manifold.png | ||
|
||
The ``Manifold`` visualizer allows access to all currently available | ||
scikit-learn manifold implementations by specifying the manifold as a string to the visualizer. The currently implemented default manifolds are as follows: | ||
|
||
============== ============================================================ | ||
Manifold Description | ||
-------------- ------------------------------------------------------------ | ||
``"lle"`` `Locally Linear Embedding`_ (LLE) uses many local linear | ||
decompositions to preserve globally non-linear structures. | ||
``"ltsa"`` `LTSA LLE`_: local tangent space alignment is similar to LLE | ||
in that it uses locality to preserve neighborhood distances. | ||
``"hessian"`` `Hessian LLE`_ an LLE regularization method that applies a | ||
hessian-based quadratic form at each neighborhood | ||
``"modified"`` `Modified LLE`_ applies a regularization parameter to LLE. | ||
``"isomap"`` `Isomap`_ seeks a lower dimensional embedding that maintains | ||
geometric distances between each instance. | ||
``"mds"`` `MDS`_: multi-dimensional scaling uses similarity to plot | ||
points that are near to each other close in the embedding. | ||
``"spectral"`` `Spectral Embedding`_ a discrete approximation of the low | ||
dimensional manifold using a graph representation. | ||
``"tsne"`` `t-SNE`_: converts the similarity of points into probabilities | ||
then uses those probabilities to create an embedding. | ||
============== ============================================================ | ||
|
||
Each manifold algorithm produces a different embedding and takes advantage of | ||
different properties of the underlying data. Generally speaking, it requires | ||
multiple attempts on new data to determine the manifold that works best for | ||
the structures latent in your data. Note however, that different manifold | ||
algorithms have different time, complexity, and resource requirements. | ||
|
||
Manifolds can be used on many types of problems, and the color used in the | ||
scatter plot can describe the target instance. In an unsupervised or | ||
clustering problem, a single color is used to show structure and overlap. In | ||
a classification problem discrete colors are used for each class. In a | ||
regression problem, a color map can be used to describe points as a heat map | ||
of their regression values. | ||
|
||
Discrete Target | ||
--------------- | ||
|
||
In a classification or clustering problem, the instances can be described by | ||
discrete labels - the classes or categories in the supervised problem, or the | ||
clusters they belong to in the unsupervised version. The manifold visualizes | ||
this by assigning a color to each label and showing the labels in a legend. | ||
|
||
.. code:: python | ||
# Load the classification data set | ||
data = load_data('occupancy') | ||
# Specify the features of interest | ||
features = [ | ||
"temperature", "relative humidity", "light", "C02", "humidity" | ||
] | ||
# Extract the data from the data frame. | ||
X = data[features] | ||
y = data.occupancy | ||
.. code:: python | ||
from yellowbrick.features.manifold import Manifold | ||
visualizer = Manifold(manifold='tsne', target='discrete') | ||
visualizer.fit_transform(X,y) | ||
visualizer.poof() | ||
.. image:: images/occupancy_tsne_manifold.png | ||
|
||
The visualization also displays the amount of time it takes to generate the | ||
embedding; as you can see, this can take a long time even for relatively | ||
small datasets. One tip is scale your data using the ``StandardScalar``; | ||
another is to sample your instances (e.g. using ``train_test_split`` to | ||
preserve class stratification) or to filter features to decrease sparsity in | ||
the dataset. | ||
|
||
One common mechanism is to use `SelectKBest` to select the features that have | ||
a statistical correlation with the target dataset. For example, we can use | ||
the ``f_classif`` score to find the 3 best features in our occupancy dataset. | ||
|
||
.. code:: python | ||
from sklearn.pipeline import Pipeline | ||
from sklearn.feature_selection import SelectKBest | ||
from sklearn.feature_selection import f_classif | ||
model = Pipeline([ | ||
("selectk", SelectKBest(k=3, score_func=f_classif)), | ||
("viz", Manifold(manifold='isomap', target='discrete')), | ||
]) | ||
X, y = load_occupancy_data() | ||
model.fit(X, y) | ||
model.named_steps['viz'].poof() | ||
.. image:: images/occupancy_select_k_best_isomap_manifold.png | ||
|
||
Continuous Target | ||
----------------- | ||
|
||
For a regression target or to specify color as a heat-map of continuous | ||
values, specify ``target='continuous'``. Note that by default the param | ||
``target='auto'`` is set, which determines if the target is discrete or | ||
continuous by counting the number of unique values in ``y``. | ||
|
||
.. code:: python | ||
# Specify the features of interest | ||
feature_names = [ | ||
'cement', 'slag', 'ash', 'water', 'splast', 'coarse', 'fine', 'age' | ||
] | ||
target_name = 'strength' | ||
# Get the X and y data from the DataFrame | ||
X = data[feature_names] | ||
y = data[target_name] | ||
.. code:: python | ||
visualizer = Manifold(manifold='isomap', target='continuous') | ||
visualizer.fit_transform(X,y) | ||
visualizer.poof() | ||
.. image:: images/concrete_isomap_manifold.png | ||
|
||
API Reference | ||
------------- | ||
|
||
.. automodule:: yellowbrick.features.manifold | ||
:members: Manifold | ||
:undoc-members: | ||
:show-inheritance: | ||
|
||
|
||
.. _`manifold learning`: http://scikit-learn.org/stable/modules/manifold.html | ||
.. _`manifold comparisons`: http://scikit-learn.org/stable/auto_examples/manifold/plot_compare_methods.html | ||
.. _`Locally Linear Embedding`: http://scikit-learn.org/stable/modules/manifold.html#locally-linear-embedding | ||
.. _`LTSA LLE`: http://scikit-learn.org/stable/modules/manifold.html#local-tangent-space-alignment | ||
.. _`Hessian LLE`: http://scikit-learn.org/stable/modules/manifold.html#hessian-eigenmapping> | ||
.. _`Modified LLE`: http://scikit-learn.org/stable/modules/manifold.html#modified-locally-linear-embedding | ||
.. _`Isomap`: http://scikit-learn.org/stable/modules/manifold.html#isomap | ||
.. _`MDS`: http://scikit-learn.org/stable/modules/manifold.html#multi-dimensional-scaling-mds | ||
.. _`Spectral Embedding`: http://scikit-learn.org/stable/modules/manifold.html#spectral-embedding | ||
.. _`t-SNE`: http://scikit-learn.org/stable/modules/manifold.html#t-distributed-stochastic-neighbor-embedding-t-sne |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file added
BIN
+59.7 KB
tests/baseline_images/test_features/test_manifold/test_manifold_classification.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+54.4 KB
tests/baseline_images/test_features/test_manifold/test_manifold_pandas.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+56.1 KB
tests/baseline_images/test_features/test_manifold/test_manifold_regression.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+4.41 KB
tests/baseline_images/test_features/test_manifold/test_manifold_single.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.