neurodata · adam2392 · Mar 28, 2023 · Mar 27, 2023 · Mar 27, 2023 · Mar 27, 2023
diff --git a/.github/workflows/check-changelog.yml b/.github/workflows/check-changelog.yml
@@ -10,12 +10,13 @@ jobs:
   check:
     name: A reviewer will let you know if it is required or can be bypassed
     runs-on: ubuntu-latest
-    if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 }} && github.repository == 'scikit-learn/scikit-learn'
+    if: ${{ contains(github.event.pull_request.labels.*.name, 'No Changelog Needed') == 0 && github.repository == 'scikit-learn/scikit-learn' }}
     steps:
       - name: Get PR number and milestone
         run: |
           echo "PR_NUMBER=${{ github.event.pull_request.number }}" >> $GITHUB_ENV
           echo "TAGGED_MILESTONE=${{ github.event.pull_request.milestone.title }}" >> $GITHUB_ENV
+          echo "${{ github.repository }}"
       - uses: actions/checkout@v3
         with:
           fetch-depth: '0'

diff --git a/README.rst b/README.rst
@@ -232,12 +232,42 @@ Python API:
   users from generalizing the ``Criterion`` and ``Splitter`` and creating a neat Python API wrapper.
   Moreover, the ``Tree`` class is not customizable.
   - Our fix: We internally implement a private function to actually build the entire tree, ``BaseDecisionTree._build_tree``, which can be overridden in subclasses that customize the criterion, splitter, or tree, or any combination of them.
+- ``sklearn.ensemble.BaseForest`` and its subclass algorithms are slow when ``n_samples`` is very high. Binning
+  features into a histogram, which is the basis of "LightGBM" and "HistGradientBoostingClassifier" is a computational
+  trick that can both significantly increase runtime efficiency, but also help prevent overfitting in trees, since
+  the sorting in "BestSplitter" is done on bins rather than the continuous feature values. This would enable
+  random forests and their variants to scale to millions of samples.
+  - Our fix: We added a ``max_bins=None`` keyword argument to the ``BaseForest`` class, and all its subclasses. The default behavior is no binning. The current implementation is not necessarily efficient. There are several improvements to be made. See below.
 
 Overall, the existing tree models, such as :class:`~sklearn.tree.DecisionTreeClassifier`
 and :class:`~sklearn.ensemble.RandomForestClassifier` all work exactly the same as they
 would in ``scikit-learn`` main, but these extensions enable 3rd-party packages to extend
 the Cython/Python API easily.
 
+Roadmap
+-------
+There are several improvements that can be made in this fork. Primarily, the binning feature
+promises to make Random Forests and their variants ultra-fast. However, the binning needs
+to be implemented in a similar fashion to ``HistGradientBoostingClassifier``, which passes
+in the binning thresholds throughout the tree construction step, such that the split nodes
+store the actual numerical value of the bin rather than the "bin index". This requires
+modifying the tree Cython code to take in a ``binning_thresholds`` parameter that is part
+of the ``_BinMapper`` fitted class. This also allows us not to do any binning during prediction/apply
+time because the tree already stores the "numerical" threshold value we would want to apply
+to any incoming ``X`` that is not binned.
+
+Besides that modification, the tree and splitter need to be able to handle not just ``np.float32``
+data (the type for X normally in Random Forests), but also ``uint8`` data (the type for X when it
+is binned in to e.g. 255 bins). This would not only save RAM since ``uint8`` storage of millions
+of samples would result in many GB saved, but also improved runtime.
+
+So in summary, the Cython code of the tree submodule needs to take in an extra parameter for
+the binning thresholds if binning occurs and also be able to handle ``X`` being of dtype ``uint8``.
+Afterwards, Random Forests will have fully leveraged the binning feature.
+
+Something to keep in mind is that upstream scikit-learn is actively working on incorporating
+missing-value handling and categorical handling into Random Forests.
+
 Next steps
 ----------
 

diff --git a/asv_benchmarks/benchmarks/ensemble.py b/asv_benchmarks/benchmarks/ensemble.py
@@ -2,7 +2,6 @@
     RandomForestClassifier,
     GradientBoostingClassifier,
     HistGradientBoostingClassifier,
-    ObliqueRandomForestClassifier,
 )
 
 from .common import Benchmark, Estimator, Predictor
@@ -14,43 +13,6 @@
 from .utils import make_gen_classif_scorers
 
 
-class ObliqueRandomForestClassifierBenchmark(Predictor, Estimator, Benchmark):
-    """
-    Benchmarks for RandomForestClassifier.
-    """
-
-    param_names = ["representation", "n_jobs"]
-    params = (["dense"], Benchmark.n_jobs_vals)
-
-    def setup_cache(self):
-        super().setup_cache()
-
-    def make_data(self, params):
-        representation, n_jobs = params
-
-        data = _20newsgroups_lowdim_dataset()
-
-        return data
-
-    def make_estimator(self, params):
-        representation, n_jobs = params
-
-        n_estimators = 500 if Benchmark.data_size == "large" else 100
-
-        estimator = ObliqueRandomForestClassifier(
-            n_estimators=n_estimators,
-            min_samples_split=10,
-            max_features="log2",
-            n_jobs=n_jobs,
-            random_state=0,
-        )
-
-        return estimator
-
-    def make_scorers(self):
-        make_gen_classif_scorers(self)
-
-
 class RandomForestClassifierBenchmark(Predictor, Estimator, Benchmark):
     """
     Benchmarks for RandomForestClassifier.

diff --git a/doc/modules/ensemble.rst b/doc/modules/ensemble.rst
@@ -195,27 +195,6 @@ in bias::
     :align: center
     :scale: 75%
 
-Oblique Random Forests
-----------------------
-
-In oblique random forests (see :class:`ObliqueRandomForestClassifier` and
-:class:`ObliqueRandomForestRegressor` classes), each tree in the ensemble is built
-from a sample drawn with replacement (i.e., a bootstrap sample) from the
-training set. The oblique random forest is the same as that of a random forest,
-except in how the splits are computed in each tree.
-
-Similar to how random forests achieve a reduced variance by combining diverse trees,
-sometimes at the cost of a slight increase in bias, oblique random forests aim to do the same.
-They are motivated to construct even more diverse trees, thereby improving model generalization.
-In practice the variance reduction is often significant hence yielding an overall better model.
-
-In contrast to the original publication [B2001]_, the scikit-learn
-implementation allows the user to control the number of features to combine in computing
-candidate splits. This is done via the ``feature_combinations`` parameter. For
-more information and intuition, see
-:ref:`documentation on oblique decision trees <oblique_trees>`.
-
-
 .. _random_forest_parameters:
 
 Parameters

diff --git a/doc/modules/tree.rst b/doc/modules/tree.rst
@@ -614,49 +614,6 @@ be pruned. This process stops when the pruned tree's minimal
 
     * :ref:`sphx_glr_auto_examples_tree_plot_cost_complexity_pruning.py`
 
-.. _oblique_trees:
-
-Oblique Trees
-=============
-
-Similar to DTs, **Oblique Trees (OTs)** are a non-parametric supervised learning
-method used for :ref:`classification <tree_classification>` and :ref:`regression
-<tree_regression>`. It was originally described as ``Forest-RC`` in Breiman's
-landmark paper on Random Forests [RF]_. Breiman found that combining data features
-empirically outperforms DTs on a variety of data sets.
-
-The algorithm implemented in scikit-learn differs from ``Forest-RC`` in that
-it allows the user to specify the number of variables to combine to consider
-as a split, :math:`\lambda`. If :math:`\lambda` is set to ``n_features``, then
-it is equivalent to ``Forest-RC``. :math:`\lambda` presents a tradeoff between
-considering dense combinations of features vs sparse combinations of features.
-
-Differences compared to decision trees
---------------------------------------
-
-Compared to DTs, OTs differ in how they compute a candidate split. DTs split
-along the passed in data columns in an axis-aligned fashion, whereas OTs split
-along oblique curves. Using the Iris dataset, we can similarly construct an OT
-as follows:
-
-    >>> from sklearn.datasets import load_iris
-    >>> from sklearn import tree
-    >>> iris = load_iris()
-    >>> X, y = iris.data, iris.target
-    >>> clf = tree.ObliqueDecisionTreeClassifier()
-    >>> clf = clf.fit(X, y)
-
-.. figure:: ../auto_examples/tree/images/sphx_glr_plot_iris_dtc_002.png
-   :target: ../auto_examples/tree/plot_iris_dtc.html
-   :scale: 75
-   :align: center
-
-Another major difference to DTs is that OTs can by definition sample more candidate
-splits. The parameter ``max_features`` controls how many splits to sample at each
-node. For DTs "max_features" is constrained to be at most "n_features" by default,
-whereas OTs can sample possibly up to :math:`2^{n_{features}}` candidate splits
-because they are combining features.
-
 Classification, regression and multi-output problems
 ----------------------------------------------------
 
@@ -709,50 +666,6 @@ optimization (e.g. `GridSearchCV`). If one has prior knowledge about how the dat
 distributed along its features, such as data being axis-aligned, then one might use a DT.
 Other considerations are runtime and space complexity.
 
-Mathematical formulation
-------------------------
-
-Given training vectors :math:`x_i \in R^n`, i=1,..., l and a label vector
-:math:`y \in R^l`, an oblique decision tree recursively partitions the
-feature space such that the samples with the same labels or similar target
-values are grouped together. Normal decision trees partition the feature space
-in an axis-aligned manner splitting along orthogonal axes based on the dimensions
-(columns) of :math:`x_i`. In oblique trees, nodes sample a random projection vector,
-:math:`a_i \in R^n`, where the inner-product of :math:`\langle a_i, x_i \rangle`
-is a candidate split value. The entries of :math:`a_i` have values
-+/- 1 with probability :math:`\lambda / n` with the rest being 0s.
-
-Let the data at node :math:`m` be represented by :math:`Q_m` with :math:`n_m`
-samples. For each candidate split :math:`\theta = (a_i, t_m)` consisting of a
-(possibly sparse) vector :math:`a_i` and threshold :math:`t_m`, partition the
-data into :math:`Q_m^{left}(\theta)` and :math:`Q_m^{right}(\theta)` subsets
-
-.. math::
-
-    Q_m^{left}(\theta) = \{(x, y) | a_i^T x_j \leq t_m\}
-
-    Q_m^{right}(\theta) = Q_m \setminus Q_m^{left}(\theta)
-
-Note that this formulation is a generalization of decision trees, where
-:math:`a_i = e_i`, a standard basis vector with a "1" at index "i" and "0"
-elsewhere. 
-
-The quality of a candidate split of node :math:`m` is then computed using an
-impurity function or loss function :math:`H()`, in the same exact manner as
-decision trees.
-
-Limitations compared to decision trees
---------------------------------------
-
-  * There currently does not exist support for pruning OTs, such as with the minimal
-    cost-complexity pruning algorithm.
-
-  * Moreover, OTs do not have built-in support for missing data, so the recommendation
-    by scikit-learn is for users to first impute, or drop their missing data if they
-    would like to use OTs.
-
-  * Currently, OTs also does not support sparse inputs for data matrices and labels.
-
 .. topic:: References:
 
     .. [BRE] L. Breiman, J. Friedman, R. Olshen, and C. Stone. Classification

diff --git a/doc/whats_new/v1.2.rst b/doc/whats_new/v1.2.rst
@@ -950,11 +950,6 @@ Changelog
 :mod:`sklearn.tree`
 ...................
 
-- |MajorFeature| Add oblique decision trees and forests for classification
-  with :class:`tree.ObliqueDecisionTreeClassifier` and
-  :class:`ensemble.ObliqueRandomForestClassifier`. :pr:`22754` by
-  `Adam Li <adam2392>`.
-
 - |Enhancement| :func:`tree.plot_tree`, :func:`tree.export_graphviz` now uses
   a lower case `x[i]` to represent feature `i`. :pr:`23480` by `Thomas Fan`_.