Skip to content

Commit

Permalink
Merge branch 'release-1.3'
Browse files Browse the repository at this point in the history
  • Loading branch information
rebeccabilbro committed Feb 9, 2021
2 parents d58ab34 + ad81093 commit 38e6b31
Show file tree
Hide file tree
Showing 102 changed files with 1,442 additions and 452 deletions.
12 changes: 6 additions & 6 deletions .travis.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,18 @@ dist: xenial
language: python
matrix:
include:
- name: "Python 3.6 on Xenial Linux"
python: '3.6'

- name: "Python 3.7 on Xenial Linux"
python: '3.7'

- name: "Miniconda 3.6 on Xenial Linux"
env: ANACONDA="3.6"
- name: "Python 3.8 on Xenial Linux"
python: '3.8'

- name: "Miniconda 3.7 on Xenial Linux"
env: ANACONDA="3.7"

- name: "Miniconda 3.8 on Xenial Linux"
env: ANACONDA="3.8"

before_install:
- sudo apt-get update;
- if [[ "$TRAVIS_OS_NAME" == "linux" ]]; then
Expand All @@ -23,8 +23,8 @@ before_install:

install:
- if [[ -z ${ANACONDA} ]]; then
pip install -r requirements.txt;
pip install -r tests/requirements.txt;
pip install -r requirements.txt;
pip install coveralls;
else
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-$MINICONDA_OS-x86_64.sh -O miniconda.sh;
Expand Down
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,8 +66,8 @@ from sklearn.svm import LinearSVC
from yellowbrick.classifier import ROCAUC

model = LinearSVC()
model.fit(X,y)
visualizer = ROCAUC(model)
visualizer.fit(X,y)
visualizer.score(X,y)
visualizer.show()
```
Expand Down
44 changes: 44 additions & 0 deletions docs/api/model_selection/importances.rst
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,50 @@ Taking the mean of the importances may be undesirable for several reasons. For e
viz.fit(X, y)
viz.show()

Top and Bottom Feature Importances
----------------------------------

It may be more illuminating to the feature engineering process to identify the most or least informative features. To view only the N most informative features, specify the ``topn`` argument to the visualizer. Similar to slicing a ranked list by their importance, if ``topn`` is a postive integer, then the most highly ranked features are used. If ``topn`` is a negative integer, then the lowest ranked features are displayed instead.

.. plot::
:context: close-figs
:alt: Coefficient importances for LASSO regression

from sklearn.linear_model import Lasso
from yellowbrick.datasets import load_concrete
from yellowbrick.model_selection import FeatureImportances

# Load the regression dataset
dataset = load_concrete(return_dataset=True)
X, y = dataset.to_data()

# Title case the feature for better display and create the visualizer
labels = list(map(lambda s: s.title(), dataset.meta['features']))
viz = FeatureImportances(Lasso(), labels=labels, relative=False, topn=3)

# Fit and show the feature importances
viz.fit(X, y)
viz.show()

Using ``topn=3``, we can identify the three most informative features in the concrete dataset as ``splast``, ``cement``, and ``water``. This approach to visualization may assist with *factor analysis* - the study of how variables contribute to an overall model. Note that although ``water`` has a negative coefficient, it is the magnitude (absolute value) of the feature that matters since we are closely inspecting the negative correlation of ``water`` with the strength of concrete. Alternatively, ``topn=-3`` would reveal the three least informative features in the model. This approach is useful to model tuning similar to :doc:`rfecv`, but instead of automatically removing features, it would allow you to identify the lowest-ranked features as they change in different model instantiations. In either case, if you have many features, using ``topn`` can significantly increase the visual and analytical capacity of your analysis.

The ``topn`` parameter can also be used when ``stacked=True``. In the context of stacked feature importance graphs, the information of a feature is the width of the entire bar, or the sum of the absolute value of all coefficients contained therein.

.. plot::
:context: close-figs
:alt: Stacked per-class importances with Logistic Regression

from yellowbrick.model_selection import FeatureImportances
from sklearn.linear_model import LogisticRegression
from sklearn.datasets import load_iris

data = load_iris()
X, y = data.data, data.target

model = LogisticRegression(multi_class="auto", solver="liblinear")
viz = FeatureImportances(model, stack=True, relative=False, topn=-3)
viz.fit(X, y)
viz.show()

Discussion
----------
Expand Down
31 changes: 29 additions & 2 deletions docs/api/text/dispersion.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,10 @@
Dispersion Plot
===============

A word's importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a word's homogeneity across the parts of a corpus. This plot notes the occurrences of a word and how many words from the beginning of the corpus it appears.
A word's importance can be weighed by its dispersion in a corpus. Lexical dispersion is a measure of a word's homogeneity across the parts of a corpus.

Lexical dispersion illustrates the homogeneity of a word (or set of words) across
the documents of a corpus. ``DispersionPlot`` allows for visualization of the lexical dispersion of words in a corpus. This plot illustrates with vertical lines the occurrences of one or more search terms throughout the corpus, noting how many words relative to the beginning of the corpus it appears.

================= ==============================
Visualizer :class:`~yellowbrick.text.dispersion.DispersionPlot`
Expand Down Expand Up @@ -33,6 +36,30 @@ Workflow Feature Engineering
visualizer.fit(text)
visualizer.show()

If the target vector of the corpus documents is provided, the points will be colored with respect to their document category, which allows for additional analysis of relationships in search term homogeneity within and across document categories.

.. plot::
:context: close-figs
:alt: Dispersion Plot with Classes

from yellowbrick.text import DispersionPlot
from yellowbrick.datasets import load_hobbies

corpus = load_hobbies()
text = [doc.split() for doc in corpus.data]
y = corpus.target

target_words = ['points', 'money', 'score', 'win', 'reduce']

visualizer = DispersionPlot(
target_words,
colormap="Accent",
title="Lexical Dispersion Plot, Broken Down by Class"
)
visualizer.fit(text, y)
visualizer.show()


Quick Method
------------

Expand All @@ -55,7 +82,7 @@ The same functionality above can be achieved with the associated quick method `d
target_words = ['features', 'mobile', 'cooperative', 'competitive', 'combat', 'online']

# Create the visualizer and draw the plot
dispersion(target_words, text)
dispersion(target_words, text, colors=['olive'])


API Reference
Expand Down
29 changes: 29 additions & 0 deletions docs/changelog.rst
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,33 @@
Changelog
=========

Version 1.3
------------

* Tag: v1.3_
* Deployed Tuesday, February 9, 2021
* Current Contributors: Benjamin Bengfort, Rebecca Bilbro, Paul Johnson, Philippe Billet, Prema Roman, Patrick Deziel

This version primarily repairs the dependency issues we faced with scipy 1.6, scikit-learn 0.24 and Python 3.6 (or earlier). As part of the rapidly changing Python library landscape, we've been forced to react quickly to dependency changes, even where those libraries have been responsibly issuing future and deprecation warnings.

Major Changes:
- Implement new ``set_params`` and ``get_params`` on ModelVisualizers to ensure wrapped estimator is being correctly accessed via the new Estimator methods.
- Freeze the test dependencies to prevent variability in CI (must periodically review dependencies to ensure we're testing what our users are experiencing).
- Change ``model`` param to ``estimator`` param to ensure that Visualizer arguments match their property names so that inspect works with get and set params and other scikit-learn utility functions.

Minor Changes:
- Import scikit-learn private API ``_safe_indexing`` without error.
- Remove any calls to ``set_params`` in Visualizer ``__init__`` methods.
- Modify test fixtures and baseline images to accommodate new sklearn implementation
- Set the numpy dependency to be less than 1.20 because this is causing Pickle issues with joblib and umap
- Add ``shuffle=True`` argument to any CV class that uses a random seed.
- Set our CI matrix to Python and Miniconda 3.7 and 3.8
- Correction in README regarding ModelVisualizer API.


.. _v1.3: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.3


Hotfix 1.2.1
------------

Expand All @@ -12,6 +39,8 @@ Hotfix 1.2.1

On December 22, 2020, scikit-learn released version 0.24 which deprecated the external use of scikit-learn's internal utilities such as ``safe_indexing``. Unfortunately, Yellowbrick depends on a few of these utilities and must refactor our internal code base to port this functionality or work around it. To ensure that Yellowbrick continues to work when installed via ``pip``, we have temporarily changed our scikit-learn dependency to be less than 0.24. We will update our dependencies on the v1.3 release when we have made the associated fixes.

.. _v1.2.1: https://github.com/DistrictDataLabs/yellowbrick/releases/tag/v1.2.1


Version 1.2
-----------
Expand Down
2 changes: 2 additions & 0 deletions docs/governance/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -218,3 +218,5 @@ Board of Advisors Minutes
minutes/2019-05-15.rst
minutes/2019-09-09.rst
minutes/2020-01-07.rst
minutes/2020-05-13.rst
minutes/2020-10-06.rst
Loading

0 comments on commit 38e6b31

Please sign in to comment.