Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Binfork #38

Merged
merged 13 commits into from
Mar 28, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .github/workflows/check-changelog.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,13 @@ jobs:
check:
name: A reviewer will let you know if it is required or can be bypassed
runs-on: ubuntu-latest
if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 }} && github.repository == 'scikit-learn/scikit-learn'
if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 && github.repository == 'scikit-learn/scikit-learn' }}
steps:
- name: Get PR number and milestone
run: |
echo "PR_NUMBER=${{ github.event.pull_request.number }}" >> $GITHUB_ENV
echo "TAGGED_MILESTONE=${{ github.event.pull_request.milestone.title }}" >> $GITHUB_ENV
echo "${{ github.repository }}"
- uses: actions/checkout@v3
with:
fetch-depth: '0'
Expand Down
30 changes: 30 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -232,12 +232,42 @@ Python API:
users from generalizing the ``Criterion`` and ``Splitter`` and creating a neat Python API wrapper.
Moreover, the ``Tree`` class is not customizable.
- Our fix: We internally implement a private function to actually build the entire tree, ``BaseDecisionTree._build_tree``, which can be overridden in subclasses that customize the criterion, splitter, or tree, or any combination of them.
- ``sklearn.ensemble.BaseForest`` and its subclass algorithms are slow when ``n_samples`` is very high. Binning
features into a histogram, which is the basis of "LightGBM" and "HistGradientBoostingClassifier" is a computational
trick that can both significantly increase runtime efficiency, but also help prevent overfitting in trees, since
the sorting in "BestSplitter" is done on bins rather than the continuous feature values. This would enable
random forests and their variants to scale to millions of samples.
- Our fix: We added a ``max_bins=None`` keyword argument to the ``BaseForest`` class, and all its subclasses. The default behavior is no binning. The current implementation is not necessarily efficient. There are several improvements to be made. See below.

Overall, the existing tree models, such as :class:`~sklearn.tree.DecisionTreeClassifier`
and :class:`~sklearn.ensemble.RandomForestClassifier` all work exactly the same as they
would in ``scikit-learn`` main, but these extensions enable 3rd-party packages to extend
the Cython/Python API easily.

Roadmap
-------
There are several improvements that can be made in this fork. Primarily, the binning feature
promises to make Random Forests and their variants ultra-fast. However, the binning needs
to be implemented in a similar fashion to ``HistGradientBoostingClassifier``, which passes
in the binning thresholds throughout the tree construction step, such that the split nodes
store the actual numerical value of the bin rather than the "bin index". This requires
modifying the tree Cython code to take in a ``binning_thresholds`` parameter that is part
of the ``_BinMapper`` fitted class. This also allows us not to do any binning during prediction/apply
time because the tree already stores the "numerical" threshold value we would want to apply
to any incoming ``X`` that is not binned.

Besides that modification, the tree and splitter need to be able to handle not just ``np.float32``
data (the type for X normally in Random Forests), but also ``uint8`` data (the type for X when it
is binned in to e.g. 255 bins). This would not only save RAM since ``uint8`` storage of millions
of samples would result in many GB saved, but also improved runtime.

So in summary, the Cython code of the tree submodule needs to take in an extra parameter for
the binning thresholds if binning occurs and also be able to handle ``X`` being of dtype ``uint8``.
Afterwards, Random Forests will have fully leveraged the binning feature.

Something to keep in mind is that upstream scikit-learn is actively working on incorporating
missing-value handling and categorical handling into Random Forests.

Next steps
----------

Expand Down
38 changes: 0 additions & 38 deletions asv_benchmarks/benchmarks/ensemble.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,6 @@
RandomForestClassifier,
GradientBoostingClassifier,
HistGradientBoostingClassifier,
ObliqueRandomForestClassifier,
)

from .common import Benchmark, Estimator, Predictor
Expand All @@ -14,43 +13,6 @@
from .utils import make_gen_classif_scorers


class ObliqueRandomForestClassifierBenchmark(Predictor, Estimator, Benchmark):
"""
Benchmarks for RandomForestClassifier.
"""

param_names = ["representation", "n_jobs"]
params = (["dense"], Benchmark.n_jobs_vals)

def setup_cache(self):
super().setup_cache()

def make_data(self, params):
representation, n_jobs = params

data = _20newsgroups_lowdim_dataset()

return data

def make_estimator(self, params):
representation, n_jobs = params

n_estimators = 500 if Benchmark.data_size == "large" else 100

estimator = ObliqueRandomForestClassifier(
n_estimators=n_estimators,
min_samples_split=10,
max_features="log2",
n_jobs=n_jobs,
random_state=0,
)

return estimator

def make_scorers(self):
make_gen_classif_scorers(self)


class RandomForestClassifierBenchmark(Predictor, Estimator, Benchmark):
"""
Benchmarks for RandomForestClassifier.
Expand Down
21 changes: 0 additions & 21 deletions doc/modules/ensemble.rst
Original file line number Diff line number Diff line change
Expand Up @@ -195,27 +195,6 @@ in bias::
:align: center
:scale: 75%

Oblique Random Forests
----------------------

In oblique random forests (see :class:`ObliqueRandomForestClassifier` and
:class:`ObliqueRandomForestRegressor` classes), each tree in the ensemble is built
from a sample drawn with replacement (i.e., a bootstrap sample) from the
training set. The oblique random forest is the same as that of a random forest,
except in how the splits are computed in each tree.

Similar to how random forests achieve a reduced variance by combining diverse trees,
sometimes at the cost of a slight increase in bias, oblique random forests aim to do the same.
They are motivated to construct even more diverse trees, thereby improving model generalization.
In practice the variance reduction is often significant hence yielding an overall better model.

In contrast to the original publication [B2001]_, the scikit-learn
implementation allows the user to control the number of features to combine in computing
candidate splits. This is done via the ``feature_combinations`` parameter. For
more information and intuition, see
:ref:`documentation on oblique decision trees <oblique_trees>`.


.. _random_forest_parameters:

Parameters
Expand Down
87 changes: 0 additions & 87 deletions doc/modules/tree.rst
Original file line number Diff line number Diff line change
Expand Up @@ -614,49 +614,6 @@ be pruned. This process stops when the pruned tree's minimal

* :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`

.. _oblique_trees:

Oblique Trees
=============

Similar to DTs, **Oblique Trees (OTs)** are a non-parametric supervised learning
method used for :ref:`classification <tree_classification>` and :ref:`regression
<tree_regression>`. It was originally described as ``Forest-RC`` in Breiman's
landmark paper on Random Forests [RF]_. Breiman found that combining data features
empirically outperforms DTs on a variety of data sets.

The algorithm implemented in scikit-learn differs from ``Forest-RC`` in that
it allows the user to specify the number of variables to combine to consider
as a split, :math:`\lambda`. If :math:`\lambda` is set to ``n_features``, then
it is equivalent to ``Forest-RC``. :math:`\lambda` presents a tradeoff between
considering dense combinations of features vs sparse combinations of features.

Differences compared to decision trees
--------------------------------------

Compared to DTs, OTs differ in how they compute a candidate split. DTs split
along the passed in data columns in an axis-aligned fashion, whereas OTs split
along oblique curves. Using the Iris dataset, we can similarly construct an OT
as follows:

>>> from sklearn.datasets import load_iris
>>> from sklearn import tree
>>> iris = load_iris()
>>> X, y = iris.data, iris.target
>>> clf = tree.ObliqueDecisionTreeClassifier()
>>> clf = clf.fit(X, y)

.. figure:: ../auto_examples/tree/images/sphx_glr_plot_iris_dtc_002.png
:target: ../auto_examples/tree/plot_iris_dtc.html
:scale: 75
:align: center

Another major difference to DTs is that OTs can by definition sample more candidate
splits. The parameter ``max_features`` controls how many splits to sample at each
node. For DTs "max_features" is constrained to be at most "n_features" by default,
whereas OTs can sample possibly up to :math:`2^{n_{features}}` candidate splits
because they are combining features.

Classification, regression and multi-output problems
----------------------------------------------------

Expand Down Expand Up @@ -709,50 +666,6 @@ optimization (e.g. `GridSearchCV`). If one has prior knowledge about how the dat
distributed along its features, such as data being axis-aligned, then one might use a DT.
Other considerations are runtime and space complexity.

Mathematical formulation
------------------------

Given training vectors :math:`x_i \in R^n`, i=1,..., l and a label vector
:math:`y \in R^l`, an oblique decision tree recursively partitions the
feature space such that the samples with the same labels or similar target
values are grouped together. Normal decision trees partition the feature space
in an axis-aligned manner splitting along orthogonal axes based on the dimensions
(columns) of :math:`x_i`. In oblique trees, nodes sample a random projection vector,
:math:`a_i \in R^n`, where the inner-product of :math:`\langle a_i, x_i \rangle`
is a candidate split value. The entries of :math:`a_i` have values
+/- 1 with probability :math:`\lambda / n` with the rest being 0s.

Let the data at node :math:`m` be represented by :math:`Q_m` with :math:`n_m`
samples. For each candidate split :math:`\theta = (a_i, t_m)` consisting of a
(possibly sparse) vector :math:`a_i` and threshold :math:`t_m`, partition the
data into :math:`Q_m^{left}(\theta)` and :math:`Q_m^{right}(\theta)` subsets

.. math::

Q_m^{left}(\theta) = \{(x, y) | a_i^T x_j \leq t_m\}

Q_m^{right}(\theta) = Q_m \setminus Q_m^{left}(\theta)

Note that this formulation is a generalization of decision trees, where
:math:`a_i = e_i`, a standard basis vector with a "1" at index "i" and "0"
elsewhere.

The quality of a candidate split of node :math:`m` is then computed using an
impurity function or loss function :math:`H()`, in the same exact manner as
decision trees.

Limitations compared to decision trees
--------------------------------------

* There currently does not exist support for pruning OTs, such as with the minimal
cost-complexity pruning algorithm.

* Moreover, OTs do not have built-in support for missing data, so the recommendation
by scikit-learn is for users to first impute, or drop their missing data if they
would like to use OTs.

* Currently, OTs also does not support sparse inputs for data matrices and labels.

.. topic:: References:

.. [BRE] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification
Expand Down
5 changes: 0 additions & 5 deletions doc/whats_new/v1.2.rst
Original file line number Diff line number Diff line change
Expand Up @@ -950,11 +950,6 @@ Changelog
:mod:`sklearn.tree`
...................

- |MajorFeature| Add oblique decision trees and forests for classification
with :class:`tree.ObliqueDecisionTreeClassifier` and
:class:`ensemble.ObliqueRandomForestClassifier`. :pr:`22754` by
`Adam Li <adam2392>`.

- |Enhancement| :func:`tree.plot_tree`, :func:`tree.export_graphviz` now uses
a lower case `x[i]` to represent feature `i`. :pr:`23480` by `Thomas Fan`_.

Expand Down
Loading