Merge branch 'release-0.5'

DistrictDataLabs · Aug 9, 2017 · 81594c1 · 81594c1
2 parents 09c8aea + 38edcca
commit 81594c1
Show file tree

Hide file tree

Showing 460 changed files with 12,554 additions and 3,178 deletions.
diff --git a/.gitignore b/.gitignore
@@ -123,3 +123,9 @@ fabric.properties
 # *.ipr
 
 .idea
+
+# VisualTestCase Outputs
+/tests/actual_images/*
+
+# Data downloaded from Yellowbrick 
+data/
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -95,14 +95,13 @@ The Yellowbrick repository is set up in a typical production/release/development
 
 You can work directly in your fork and create a pull request from your fork's develop branch into ours. We also recommend setting up an `upstream` remote so that you can easily pull the latest development changes from the main Yellowbrick repository (see [configuring a remote for a fork](https://help.github.com/articles/configuring-a-remote-for-a-fork/)). You can do that as follows:
 
-``
-$ git remote add upstream https://github.com/DistrictDataLabs/yellowbrick.git
-$ git remote -v
-origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
-origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)
-upstream  https://github.com/DistrictDataLabs/yellowbrick.git (fetch)
-upstream  https://github.com/DistrictDataLabs/yellowbrick.git (push)
-``
+`$ git remote add upstream https://github.com/DistrictDataLabs/yellowbrick.git`
+`$ git remote -v`
+> origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (fetch)
+> origin    https://github.com/YOUR_USERNAME/YOUR_FORK.git (push)
+> upstream  https://github.com/DistrictDataLabs/yellowbrick.git (fetch)
+> upstream  https://github.com/DistrictDataLabs/yellowbrick.git (push)
+
 
 When you're ready, request a code review for your pull request. Then, when reviewed and approved, you can merge your fork into our main branch. Make sure to use the "Squash and Merge" option in order to create a Git history that is understandable.
 
@@ -216,12 +215,18 @@ class MyVisualizerTests(VisualTestCase, DatasetMixin):
             self.fail("my visualizer didn't work")
 ```
 
-Tests can be run as follows::
+The entire test suite can be run as follows::
 
 ```
 $ make test
 ```
 
+You can also run your own test file as follows::
+
+```
+$ nosetests tests/test_your_visualizer.py
+```
+
 The Makefile uses the nosetest runner and testing suite as well as the coverage library, so make sure you have those dependencies installed! The `DatasetMixin` also requires requests.py to fetch data from our Amazon S3 account.
 
 ### Documentation

diff --git a/DESCRIPTION.rst b/DESCRIPTION.rst
@@ -4,7 +4,7 @@
 
 .. |Visualizers| image:: http://www.scikit-yb.org/en/latest/_images/visualizers.png
     :width: 800 px
-.. _Visualizers: http://scikit-yb.org/
+.. _Visualizers: http://www.scikit-yb.org/
 
 Yellowbrick
 ===========

diff --git a/docs/about.rst b/docs/about.rst
@@ -1,8 +1,15 @@
-=====
 About
 =====
 
-Yellowbrick is an open source, pure Python project that extends Scikit-Learn with visual analysis and diagnostic tools. The Yellowbrick API also wraps Matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
+.. image:: images/yellowbrickroad.jpg
+
+Image by QuatroCinco_, used with permission, Flickr Creative Commons.
+
+Yellowbrick is an open source, pure Python project that extends the Scikit-Learn API_ with visual analysis and diagnostic tools. The Yellowbrick API also wraps Matplotlib to create publication-ready figures and interactive data explorations while still allowing developers fine-grain control of figures. For users, Yellowbrick can help evaluate the performance, stability, and predictive value of machine learning models, and assist in diagnosing problems throughout the machine learning workflow.
+
+Recently, much of this workflow has been automated through grid search methods, standardized APIs, and GUI-based applications. In practice, however, human intuition and guidance can more effectively hone in on quality models than exhaustive search. By visualizing the model selection process, data scientists can steer towards final, explainable models and avoid pitfalls and traps.
+
+The Yellowbrick library is a diagnostic visualization platform for machine learning that allows data scientists to steer the model selection process. Yellowbrick extends the Scikit-Learn API with a new core object: the Visualizer. Visualizers allow visual models to be fit and transformed as part of the Scikit-Learn Pipeline process, providing visual diagnostics throughout the transformation of high dimensional data.
 
 The Model Selection Triple
 --------------------------
@@ -49,3 +56,21 @@ We think that's a pretty fair deal, and we're big believers in open source. If y
 .. _`@rebeccabilbro`: https://github.com/rebeccabilbro
 .. _`@bbengfort`: https://github.com/bbengfort
 .. _`District Data Labs`: http://www.districtdatalabs.com/
+
+Presentations
+-------------
+
+Yellowbrick has enjoyed the spotlight at a few conferences and in several presentations. We hope that these videos, talks, and slides will help you understand Yellowbrick a bit better.
+
+Videos:
+    - `Visual Diagnostics for More Informed Machine Learning: Within and Beyond Scikit-Learn (PyCon 2016) <https://youtu.be/c5DaaGZWQqY>`_
+    - `Visual Diagnostics for More Informed Machine Learning (PyData Carolinas 2016) <https://youtu.be/cgtNPx7fJUM>`_
+    - `Yellowbrick: Steering Machine Learning with Visual Transformers (PyData London 2017) <https://youtu.be/2ZKng7pCB5k>`_
+
+Slides:
+    - `Visualizing the Model Selection Process <https://www.slideshare.net/BenjaminBengfort/visualizing-the-model-selection-process>`_
+    - `Visualizing Model Selection with Scikit-Yellowbrick <https://www.slideshare.net/BenjaminBengfort/visualizing-model-selection-with-scikityellowbrick-an-introduction-to-developing-visualizers>`_
+    - `Visual Pipelines for Text Analysis (Data Intelligence 2017) <https://speakerdeck.com/dataintelligence/visual-pipelines-for-text-analysis>`_
+
+.. _QuatroCinco: https://flic.kr/p/2Yj9mj
+.. _API: http://scikit-learn.org/stable/modules/classes.html
diff --git a/docs/api/anscombe.py b/docs/api/anscombe.py
@@ -0,0 +1,7 @@
+# Creates the anscombe visualization. 
+
+import yellowbrick as yb
+import matplotlib.pyplot as plt
+
+g = yb.anscombe()
+plt.savefig("images/anscombe.png")
diff --git a/docs/api/anscombe.rst b/docs/api/anscombe.rst
@@ -0,0 +1,24 @@
+Anscombe's Quartet
+==================
+
+Yellowbrick has learned Anscombe's lesson - which is why we believe that
+visual diagnostics are vital to machine learning.
+
+.. code:: python
+
+    import yellowbrick as yb
+    import matplotlib.pyplot as plt
+
+    g = yb.anscombe()
+    plt.show()
+
+
+.. image:: images/anscombe.png
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.anscombe
+    :members:
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/classifier/class_balance.py b/docs/api/classifier/class_balance.py
@@ -0,0 +1,30 @@
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from sklearn.ensemble import RandomForestClassifier
+from sklearn.model_selection import train_test_split
+
+from yellowbrick.classifier import ClassBalance
+
+
+if __name__ == '__main__':
+    # Load the regression data set
+    data = pd.read_csv("../../../examples/data/occupancy/occupancy.csv")
+
+    features = ["temperature", "relative humidity", "light", "C02", "humidity"]
+    classes = ['unoccupied', 'occupied']
+
+    # Extract the numpy arrays from the data frame
+    X = data[features].as_matrix()
+    y = data.occupancy.as_matrix()
+
+    # Create the train and test data
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+    # Instantiate the classification model and visualizer
+    forest = RandomForestClassifier()
+    visualizer = ClassBalance(forest, classes=classes)
+
+    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
+    visualizer.score(X_test, y_test)  # Evaluate the model on the test data
+    g = visualizer.poof(outpath="images/class_balance.png")             # Draw/show/poof the data
diff --git a/docs/api/classifier/class_balance.rst b/docs/api/classifier/class_balance.rst
@@ -0,0 +1,43 @@
+Class Balance
+=============
+
+Oftentimes classifiers perform badly because of a class imbalance. A class balance chart can help prepare the user for such a case by showing the support for each class in the fitted
+classification model.
+
+.. code:: python
+
+    # Load the classification data set
+    data = load_data('occupancy')
+
+    # Specify the features of interest and the classes of the target
+    features = ["temperature", "relative humidity", "light", "C02", "humidity"]
+    classes = ['unoccupied', 'occupied']
+
+    # Extract the numpy arrays from the data frame
+    X = data[features].as_matrix()
+    y = data.occupancy.as_matrix()
+
+    # Create the train and test data
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+.. code:: python
+
+    # Instantiate the classification model and visualizer
+    forest = RandomForestClassifier()
+    visualizer = ClassBalance(forest, classes=classes)
+
+    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
+    visualizer.score(X_test, y_test)  # Evaluate the model on the test data
+    g = visualizer.poof()             # Draw/show/poof the data
+
+
+.. image:: images/class_balance.png
+
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.classifier.class_balance
+    :members: ClassBalance
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/classifier/classification_report.py b/docs/api/classifier/classification_report.py
@@ -0,0 +1,30 @@
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from sklearn.naive_bayes import GaussianNB
+from sklearn.model_selection import train_test_split
+
+from yellowbrick.classifier import ClassificationReport
+
+
+if __name__ == '__main__':
+    # Load the regression data set
+    data = pd.read_csv("../../../examples/data/occupancy/occupancy.csv")
+
+    features = ["temperature", "relative humidity", "light", "C02", "humidity"]
+    classes = ['unoccupied', 'occupied']
+
+    # Extract the numpy arrays from the data frame
+    X = data[features].as_matrix()
+    y = data.occupancy.as_matrix()
+
+    # Create the train and test data
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+    # Instantiate the classification model and visualizer
+    bayes = GaussianNB()
+    visualizer = ClassificationReport(bayes, classes=classes)
+
+    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
+    visualizer.score(X_test, y_test)  # Evaluate the model on the test data
+    g = visualizer.poof(outpath="images/classification_report.png")             # Draw/show/poof the data
diff --git a/docs/api/classifier/classification_report.rst b/docs/api/classifier/classification_report.rst
@@ -0,0 +1,45 @@
+Classification Report
+~~~~~~~~~~~~~~~~~~~~~
+
+The classification report visualizer displays the precision, recall, and
+F1 scores for the model. In order to support easier interpretation and problem detection, the report integrates numerical scores with a color-coded
+heatmap.
+
+.. code:: python
+
+    # Load the classification data set
+    data = load_data('occupancy')
+
+    # Specify the features of interest and the classes of the target
+    features = ["temperature", "relative humidity", "light", "C02", "humidity"]
+    classes = ['unoccupied', 'occupied']
+
+    # Extract the numpy arrays from the data frame
+    X = data[features].as_matrix()
+    y = data.occupancy.as_matrix()
+
+    # Create the train and test data
+    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
+
+.. code:: python
+
+    # Instantiate the classification model and visualizer
+    bayes = GaussianNB()
+    visualizer = ClassificationReport(bayes, classes=classes)
+
+    visualizer.fit(X_train, y_train)  # Fit the training data to the visualizer
+    visualizer.score(X_test, y_test)  # Evaluate the model on the test data
+    g = visualizer.poof()             # Draw/show/poof the data
+
+
+
+.. image:: images/classification_report.png
+
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.classifier.classification_report
+    :members: ClassificationReport
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/classifier/confusion_matrix.py b/docs/api/classifier/confusion_matrix.py
@@ -0,0 +1,26 @@
+import pandas as pd
+import matplotlib.pyplot as plt
+
+from sklearn.datasets import load_digits
+from sklearn.linear_model import LogisticRegression
+from sklearn.model_selection import train_test_split
+
+from yellowbrick.classifier import ConfusionMatrix
+
+
+if __name__ == '__main__':
+    # Load the regression data set
+    digits = load_digits()
+    X = digits.data
+    y = digits.target
+
+    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, random_state=11)
+
+    model = LogisticRegression()
+
+    #The ConfusionMatrix visualizer taxes a model
+    cm = ConfusionMatrix(model, classes=[0,1,2,3,4,5,6,7,8,9])
+
+    cm.fit(X_train, y_train)  # Fit the training data to the visualizer
+    cm.score(X_test, y_test)  # Evaluate the model on the test data
+    g = cm.poof(outpath="images/confusion_matrix.png")             # Draw/show/poof the data
diff --git a/docs/api/classifier/confusion_matrix.rst b/docs/api/classifier/confusion_matrix.rst
@@ -0,0 +1,65 @@
+Confusion Matrix
+================
+
+The ``ConfusionMatrix`` visualizer is a ScoreVisualizer that takes a
+fitted Scikit-Learn classifier and a set of test X and y values and
+returns a report showing how each of the test values predicted classes
+compare to their actual classes. Data scientists use confusion matrices
+to understand which classes are most easily confused. These provide
+similar information as what is available in a ClassificationReport, but
+rather than top-level scores they provide deeper insight into the
+classification of individual data points.
+
+Below are a few examples of using the ConfusionMatrix visualizer; more
+information can be found by looking at the
+Scikit-Learn documentation on `confusion matrices <http://scikit-learn.org/stable/modules/generated/sklearn.metrics.confusion_matrix.html>`_.
+
+.. code:: python
+
+    #First do our imports
+    import yellowbrick
+
+    from sklearn.datasets import load_digits
+    from sklearn.model_selection import train_test_split
+    from sklearn.linear_model import LogisticRegression
+
+    from yellowbrick.classifier import ConfusionMatrix
+
+.. code:: python
+
+    # We'll use the handwritten digits data set from scikit-learn.
+    # Each feature of this dataset is an 8x8 pixel image of a handwritten number.
+    # Digits.data converts these 64 pixels into a single array of features
+    digits = load_digits()
+    X = digits.data
+    y = digits.target
+
+    X_train, X_test, y_train, y_test = train_test_split(X,y, test_size =0.2, random_state=11)
+
+    model = LogisticRegression()
+
+    #The ConfusionMatrix visualizer taxes a model
+    cm = ConfusionMatrix(model, classes=[0,1,2,3,4,5,6,7,8,9])
+
+    #Fit fits the passed model. This is unnecessary if you pass the visualizer a pre-fitted model
+    cm.fit(X_train, y_train)
+
+    #To create the ConfusionMatrix, we need some test data. Score runs predict() on the data
+    #and then creates the confusion_matrix from scikit learn.
+    cm.score(X_test, y_test)
+
+    #How did we do?
+    cm.poof()
+
+
+
+.. image:: images/confusion_matrix.png
+
+
+API Reference
+-------------
+
+.. automodule:: yellowbrick.classifier.confusion_matrix
+    :members: ConfusionMatrix
+    :undoc-members:
+    :show-inheritance:
diff --git a/docs/api/classifier/images/class_balance.png b/docs/api/classifier/images/class_balance.png
diff --git a/docs/api/classifier/images/classification_report.png b/docs/api/classifier/images/classification_report.png
diff --git a/docs/api/classifier/images/confusion_matrix.png b/docs/api/classifier/images/confusion_matrix.png
diff --git a/docs/api/classifier/images/rocauc.png b/docs/api/classifier/images/rocauc.png