ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

SamuelCarliles3 · 2024-05-30T23:52:33Z

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Includes splitter injection and adds refactor of DepthFirstTreeBuilder.build

Any other comments?

…t to injections

… memory utilization in asv

added regression forest benchmark

…ubmodulev3

…ession-benchmark

upstream changes

…node-refactor3

github-actions · 2024-05-30T23:53:45Z

❌ Linting issues

This PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling pre-commit hooks. Instructions to enable them can be found here.

You can see the details of the linting issues under the lint job here

`cython-lint`

cython-lint detected issues. Please fix them locally and push the changes. Here you can see the detected issues. Note that the installed cython-lint version is cython-lint=0.16.2.


/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:22:26: 'uintptr_t' imported but unused
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:23:33: 'free' imported but unused
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:76:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:99:24: E261 at least two spaces before inline comment
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:120:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:148:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:152:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:176:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:180:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:206:53: E703 statement ends with a semicolon
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:292:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:376:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:380:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:385:5: E303 too many blank lines (2)
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:806:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:812:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:835:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:838:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_splitter.pyx:842:1: W293 blank line contains whitespace
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:273:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:274:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:275:25: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:294:29: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:295:29: E128 continuation line under-indented for visual indent
/home/runner/work/scikit-learn/scikit-learn/sklearn/tree/_tree.pyx:388:1: W293 blank line contains whitespace

_{Generated for commit: f225658. Link to the linter CI: here}

PSSF23

Refactoring of streaming code looks good to me. But these two errors occurred in checks:

FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[squared_error] - AssertionError
FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[friedman_mse] - AssertionError

adam2392

I've left a few questions and requests for explanation, or improved documentation. A few general comments:

Can you copy/paste results of benchmarking into the PR description somewhere, so we can have this documented?
Can we describe in the PR what is being changed and why? If any of this explanation is good to include also as in-line comments, then feel free to do so.

adam2392 · 2024-06-10T14:38:59Z

sklearn/tree/_tree.pxd

+# A record on the stack for depth-first tree growing
+cdef struct StackRecord:
+    intp_t start
+    intp_t end
+    intp_t depth
+    intp_t parent
+    bint is_left
+    float64_t impurity
+    intp_t n_constant_features
+    float64_t lower_bound
+    float64_t upper_bound


Is this moved for any particular reason? Just so I'm aware.

It is required by BuildEnv, which is defined in this file.

adam2392 · 2024-06-10T14:39:36Z

sklearn/tree/_tree.pxd

+cdef extern from "<stack>" namespace "std" nogil:
+    cdef cppclass stack[T]:
+        ctypedef T value_type
+        stack() except +
+        bint empty()
+        void pop()
+        void push(T&) except +  # Raise c++ exception for bad_alloc -> MemoryError
+        T& top()


I think this can be cimported from Cython directly

xref: https://github.com/cython/cython/blob/aa5e8668a9ef3dc047c305fa4971129849d0ab19/Cython/Includes/libcpp/stack.pxd

scikit-learn#29228

This bit of code was moved from _tree.pyx so that BuildEnv would have the definition of stack. If the definition can be simplified, we should do that.

adam2392 · 2024-06-10T14:42:34Z

sklearn/tree/_tree.pxd

+        void push(T&) except +  # Raise c++ exception for bad_alloc -> MemoryError
+        T& top()
+
+cdef struct BuildEnv:


Is there any technical reason we need the struct here besides encapsulation of all the parameters?

Here as opposed to _tree.pyx? I put BuildEnv here because it will likely need to be visible to event handlers slated for addition, and _tree.pxd seemed like the place to put the "interface" to the module. It can go anywhere that can be made available to external event handlers.

Sorry I meant, why do we need the BuildEnv in the first place.

Or do you mean why is the struct required at all? It is just to encapsulate the function state; we could just as easily pass all the variables/references in as function args, but this seemed cleaner and definitely more expedient during development. More importantly, it's a pattern that will be required for the forthcoming tree build event handling, and might need to be added to the splitter injection.

In general we have this algorithm which remains broadly the same shape but with an arbitrary (and growing) number of variations we'd like to add without foreknowledge and without perpetual updates to the algorithm code itself. Those future additions will require differing degrees of visibility into the algorithm state. So IMO it seemed cleanest (and most performant) to simply encapsulate the algorithm state in a struct whose address we can pass around.

adam2392 · 2024-06-10T14:53:52Z

sklearn/tree/_splitter.pxd

+# NICE IDEAS THAT DON'T APPEAR POSSIBLE
+# - accessing elements of a memory view of cython extension types in a nogil block/function
+# - storing cython extension types in cpp vectors


Suggested change

# NICE IDEAS THAT DON'T APPEAR POSSIBLE

# - accessing elements of a memory view of cython extension types in a nogil block/function

# - storing cython extension types in cpp vectors

# NICE IDEAS THAT DON'T APPEAR POSSIBLE (Samuel)

# 1. accessing elements of a memory view of cython extension types in a nogil block/function

# 2. storing cython extension types in cpp vectors

It would be great to also comment on what these nice ideas are trying to accomplish. I.e. what's the problem for a new developer coming in and reading this?

Here we're simply trying to add a way of injecting functionality whose implementation details are TBD. We just want a way of saying "here's a candidate split, let me check it against any arbitrary validity constraints you may want to impose at some future date as of the time of this writing". So we want to accept a list, a memoryview, array, vector, whatever, of instantiated split constraints. Ideally the interface is a simple python one-liner, so at runtime I can just define an inline python list of constraints. But that list of constraints then needs to be executable performantly in a cython nogil block.

I understand. Just hoping to document all the thoughts in a clean manner, so we don't lose this trains of thoughts when new developers come thru.

adam2392 · 2024-06-10T14:54:12Z

sklearn/tree/_splitter.pxd

+# despite the fact that we can access scalar extension type properties in such a context,
+# as for instance node_split_best does with Criterion and Partition,
+# and we can access the elements of a memory view of primitive types in such a context


I can't follow what you mean here. Is this related to the "nice ideas" listed above?

adam2392 · 2024-06-10T14:58:36Z

sklearn/tree/_splitter.pyx

+    intp_t n_missing,
+    bint missing_go_to_left,
+    float64_t lower_bound,
+    float64_t upper_bound,


These are part of SplitRecord

If I read the existing code correctly, those values are set in current_split only after it is accepted as passing pre-and-post split conditions, and yielding a greater impurity improvement than best_split, so that the n_missing, missing_go_to_left, lower_bound, and upper_bound values in current_split are potentially garbage values at the time these split rejection conditions are tested.

Perhaps a better alternative to passing current_split into these split rejection conditions would be to simply pass the candidate feature dimension and split point.

Yeah I would say either:

i) pass in only SplitRecord and let the function implementation worry about what is garbage vs not, cuz you shouldn't use garbage values anyways
ii) only pass in parameters that are necessary

It shouldn't pass in split record and parameters explicitly.

What it will likely ultimately end up looking like is something similar to the BuildEnv struct pattern added to DepthFirstTreeBuilder.build... I haven't yet thought hard about what the final form of this signature should look like, but it would need to contain all the splitter state information necessary for arbitrary split constraints to decide thumbs up or down. The platonic ideal would make that part of a deliberately curated interface.

adam2392 · 2024-06-10T14:58:41Z

sklearn/tree/_splitter.pyx

+    intp_t n_missing,
+    bint missing_go_to_left,
+    float64_t lower_bound,
+    float64_t upper_bound,


These are part of SplitRecord

#67 (comment)

adam2392 · 2024-06-10T14:59:04Z

sklearn/tree/_splitter.pyx

+# cdef struct HasDataEnv:
+#     int min_samples
+
+# cdef bint has_data_condition(
+#     Splitter splitter,
+#     SplitRecord* current_split,
+#     intp_t n_missing,
+#     bint missing_go_to_left,
+#     float64_t lower_bound,
+#     float64_t upper_bound,
+#     SplitConditionEnv split_condition_env
+# ) noexcept nogil:
+#     cdef HasDataEnv* e = <HasDataEnv*>split_condition_env
+#     return splitter.n_samples >= e.min_samples
+
+# cdef class HasDataCondition(SplitCondition):
+#     def __cinit__(self, int min_samples):
+#         self.c.f = has_data_condition
+#         self.c.e = malloc(sizeof(HasDataEnv))
+#         (<HasDataEnv*>self.c.e).min_samples = min_samples
+
+#     def __dealloc__(self):
+#         if self.c.e is not NULL:
+#             free(self.c.e)
+
+#         super.__dealloc__(self)
+
+# cdef struct AlphaRegularityEnv:
+#     float64_t alpha
+
+# cdef bint alpha_regularity_condition(
+#     Splitter splitter,
+#     SplitRecord* current_split,
+#     intp_t n_missing,
+#     bint missing_go_to_left,
+#     float64_t lower_bound,
+#     float64_t upper_bound,
+#     SplitConditionEnv split_condition_env
+# ) noexcept nogil:
+#     cdef AlphaRegularityEnv* e = <AlphaRegularityEnv*>split_condition_env
+
+#     return True
+
+# cdef class AlphaRegularityCondition(SplitCondition):
+#     def __cinit__(self, float64_t alpha):
+#         self.c.f = alpha_regularity_condition
+#         self.c.e = malloc(sizeof(AlphaRegularityEnv))
+#         (<AlphaRegularityEnv*>self.c.e).alpha = alpha
+
+#     def __dealloc__(self):
+#         if self.c.e is not NULL:
+#             free(self.c.e)
+
+#         super.__dealloc__(self)
+
+
+# from ._tree cimport Tree
+# cdef class FooTree(Tree):
+#     cdef Splitter splitter
+
+#     def __init__(self):
+#         self.splitter = Splitter(
+#             presplit_conditions = [HasDataCondition(10)],
+#             postsplit_conditions = [AlphaRegularityCondition(0.1)],
+#         )


Are the lines above outdated fluff we can remove?

Yes. I mainly left them in as a demonstration of the need for the SplitConditionEnv; none of the currently existing SplitConditions require an env because their env is built into the legacy pattern of the omniscient Splitter. So for example if we wanted to do alpha regularity, alpha would be a hyperparameter that would ideally go into a closure. This is a cython implementation of a closure pattern, specifically one that avoids extension types due to the field lookup overhead.

adam2392 · 2024-06-10T15:00:17Z

sklearn/tree/_splitter.pxd

@@ -59,6 +107,8 @@ cdef class BaseSplitter:

    cdef const float64_t[:] sample_weight

+    cdef SplitRecordFactoryClosure split_record_factory


Why is it named Closure?

Because it is a cython implementation of a closure. C doesn't support closures as a language level feature, but a struct of a function pointer bound with a struct of variable values functions the same.

adam2392 · 2024-06-10T15:00:53Z

sklearn/tree/_splitter.pyx

@@ -485,6 +679,8 @@ cdef inline intp_t node_split_best(
    # n_total_constants = n_known_constants + n_found_constants
    cdef intp_t n_total_constants = n_known_constants

+    cdef bint conditions_hold = True


Suggested change

cdef bint conditions_hold = True

cdef bint split_is_valid = True

Seems like a more explicit name to me

adam2392 · 2024-06-12T15:48:48Z

Another dumb question: why is depthfirsttreebuilder need to change, but not bestfirsttreebuilder? @SamuelCarliles3

SamuelCarliles3 · 2024-06-12T17:34:31Z

Another dumb question: why is depthfirsttreebuilder need to change, but not bestfirsttreebuilder? @SamuelCarliles3

BFTB will most certainly need to change as well, I'm just starting with DFTB, and have not yet gotten to BFTB. IIRC the update functionality had not been added to BFTB(?), and so it did not require an analogous refactor.

SamuelCarliles3 and others added 30 commits February 16, 2024 13:36

init split condition injection

8c09f7f

wip

ecfc9b1

wip

0c3d5c0

wip

5fd12a2

injection progress

b593ee0

injection progress

180fac3

split injection refactoring

c207c3e

added condition parameter passthrough prototype

7cc71c1

some tidying

2470d49

more tidying

ee3399f

splitter injection refactoring

a079e4f

cython injection due diligence, converted min_sample and monotonic_cs…

5397b66

…t to injections

tree tests pass huzzah!

44f1d57

added some splitconditions to header

4f19d53

commented out some sample code that was substantially increasing peak…

cb71be0

… memory utilization in asv

added vector resize

e34be5c

wip

aac802e

Merge branch 'submodulev3' into scarliles/splitter-injection-redux

c12f2fd

settling injection memory management for now

a7f5e92

added regression forest benchmark

7a70a0b

Merge pull request #2 from ssec-jhu/scarliles/regression-benchmark

d9ad68a

added regression forest benchmark

ran black for linting check

893d588

Merge branch 'submodulev3' of github.com:ssec-jhu/scikit-learn into s…

548493c

…ubmodulev3

Merge branch 'submodulev3' into scarliles/regression-benchmark

e4b53ff

Merge branch 'neurodata:submodulev3' into submodulev3

089d901

Merge branch 'submodulev3' of github.com:ssec-jhu/scikit-learn into s…

3ba5f74

…ubmodulev3

Merge branch 'scarliles/splitter-injection-redux' into scarliles/regr…

cf285c1

…ession-benchmark

Merge pull request #3 from ssec-jhu/scarliles/regression-benchmark

ffc6328

upstream changes

initial pass at refactoring DepthFirstTreeBuilder.build

87c90fd

some renaming to make closure pattern more obvious

51da586

SamuelCarliles3 added 8 commits May 28, 2024 15:52

added SplitRecordFactory

6c117a2

Merge branch 'scarliles/update-node-refactor2' into scarliles/update-…

c7b675b

…node-refactor3

SplitRecordFactory progress

9e7b131

build loop refactor

a017669

add_or_update tweak

4325b0a

reverted to back out build body refactor

78c3a1b

refactor baby step

b8cc636

update node refactor more baby steps

f225658

adam2392 requested review from adam2392, SUKI-O, PSSF23 and sampan501 June 6, 2024 15:41

PSSF23 reviewed Jun 6, 2024

View reviewed changes

adam2392 reviewed Jun 10, 2024

View reviewed changes

adam2392 changed the title ~~Scarliles/update node refactor3~~ ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism Jun 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

SamuelCarliles3 commented May 30, 2024

github-actions bot commented May 30, 2024

PSSF23 left a comment

adam2392 left a comment

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 13, 2024

adam2392 Jun 13, 2024

adam2392 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 13, 2024

SamuelCarliles3 Jun 13, 2024

adam2392 Jun 13, 2024

SamuelCarliles3 Jun 13, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 13, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

SamuelCarliles3 Jun 10, 2024

adam2392 Jun 10, 2024

adam2392 Jun 10, 2024

adam2392 commented Jun 12, 2024

SamuelCarliles3 commented Jun 12, 2024

		@@ -59,6 +107,8 @@ cdef class BaseSplitter:

		cdef const float64_t[:] sample_weight

		cdef SplitRecordFactoryClosure split_record_factory

	cdef bint conditions_hold = True
	cdef bint split_is_valid = True

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

Are you sure you want to change the base?

ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67

Conversation

SamuelCarliles3 commented May 30, 2024

Reference Issues/PRs

What does this implement/fix? Explain your changes.

Any other comments?

github-actions bot commented May 30, 2024

❌ Linting issues

cython-lint

PSSF23 left a comment

Choose a reason for hiding this comment

adam2392 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

adam2392 commented Jun 12, 2024

SamuelCarliles3 commented Jun 12, 2024

`cython-lint`