-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67
base: submodulev3
Are you sure you want to change the base?
ENH Splitter Injection and Refactoring of DepthFirstTreeBuilder's building mechanism #67
Conversation
… memory utilization in asv
added regression forest benchmark
upstream changes
❌ Linting issuesThis PR is introducing linting issues. Here's a summary of the issues. Note that you can avoid having linting issues by enabling You can see the details of the linting issues under the
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactoring of streaming code looks good to me. But these two errors occurred in checks:
FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[squared_error] - AssertionError
FAILED tree/tests/test_tree.py::test_missing_values_on_equal_nodes_no_missing[friedman_mse] - AssertionError
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've left a few questions and requests for explanation, or improved documentation. A few general comments:
- Can you copy/paste results of benchmarking into the PR description somewhere, so we can have this documented?
- Can we describe in the PR what is being changed and why? If any of this explanation is good to include also as in-line comments, then feel free to do so.
# A record on the stack for depth-first tree growing | ||
cdef struct StackRecord: | ||
intp_t start | ||
intp_t end | ||
intp_t depth | ||
intp_t parent | ||
bint is_left | ||
float64_t impurity | ||
intp_t n_constant_features | ||
float64_t lower_bound | ||
float64_t upper_bound |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this moved for any particular reason? Just so I'm aware.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is required by BuildEnv, which is defined in this file.
cdef extern from "<stack>" namespace "std" nogil: | ||
cdef cppclass stack[T]: | ||
ctypedef T value_type | ||
stack() except + | ||
bint empty() | ||
void pop() | ||
void push(T&) except + # Raise c++ exception for bad_alloc -> MemoryError | ||
T& top() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this can be cimported from Cython directly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This bit of code was moved from _tree.pyx so that BuildEnv would have the definition of stack. If the definition can be simplified, we should do that.
void push(T&) except + # Raise c++ exception for bad_alloc -> MemoryError | ||
T& top() | ||
|
||
cdef struct BuildEnv: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any technical reason we need the struct here besides encapsulation of all the parameters?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here as opposed to _tree.pyx? I put BuildEnv here because it will likely need to be visible to event handlers slated for addition, and _tree.pxd seemed like the place to put the "interface" to the module. It can go anywhere that can be made available to external event handlers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry I meant, why do we need the BuildEnv
in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or do you mean why is the struct required at all? It is just to encapsulate the function state; we could just as easily pass all the variables/references in as function args, but this seemed cleaner and definitely more expedient during development. More importantly, it's a pattern that will be required for the forthcoming tree build event handling, and might need to be added to the splitter injection.
In general we have this algorithm which remains broadly the same shape but with an arbitrary (and growing) number of variations we'd like to add without foreknowledge and without perpetual updates to the algorithm code itself. Those future additions will require differing degrees of visibility into the algorithm state. So IMO it seemed cleanest (and most performant) to simply encapsulate the algorithm state in a struct whose address we can pass around.
# NICE IDEAS THAT DON'T APPEAR POSSIBLE | ||
# - accessing elements of a memory view of cython extension types in a nogil block/function | ||
# - storing cython extension types in cpp vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# NICE IDEAS THAT DON'T APPEAR POSSIBLE | |
# - accessing elements of a memory view of cython extension types in a nogil block/function | |
# - storing cython extension types in cpp vectors | |
# NICE IDEAS THAT DON'T APPEAR POSSIBLE (Samuel) | |
# 1. accessing elements of a memory view of cython extension types in a nogil block/function | |
# 2. storing cython extension types in cpp vectors |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to also comment on what these nice ideas are trying to accomplish. I.e. what's the problem for a new developer coming in and reading this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're simply trying to add a way of injecting functionality whose implementation details are TBD. We just want a way of saying "here's a candidate split, let me check it against any arbitrary validity constraints you may want to impose at some future date as of the time of this writing". So we want to accept a list, a memoryview, array, vector, whatever, of instantiated split constraints. Ideally the interface is a simple python one-liner, so at runtime I can just define an inline python list of constraints. But that list of constraints then needs to be executable performantly in a cython nogil block.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand. Just hoping to document all the thoughts in a clean manner, so we don't lose this trains of thoughts when new developers come thru.
# despite the fact that we can access scalar extension type properties in such a context, | ||
# as for instance node_split_best does with Criterion and Partition, | ||
# and we can access the elements of a memory view of primitive types in such a context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't follow what you mean here. Is this related to the "nice ideas" listed above?
intp_t n_missing, | ||
bint missing_go_to_left, | ||
float64_t lower_bound, | ||
float64_t upper_bound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are part of SplitRecord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I read the existing code correctly, those values are set in current_split
only after it is accepted as passing pre-and-post split conditions, and yielding a greater impurity improvement than best_split
, so that the n_missing
, missing_go_to_left
, lower_bound
, and upper_bound
values in current_split
are potentially garbage values at the time these split rejection conditions are tested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a better alternative to passing current_split
into these split rejection conditions would be to simply pass the candidate feature dimension and split point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I would say either:
i) pass in only SplitRecord and let the function implementation worry about what is garbage vs not, cuz you shouldn't use garbage values anyways
ii) only pass in parameters that are necessary
It shouldn't pass in split record and parameters explicitly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What it will likely ultimately end up looking like is something similar to the BuildEnv struct pattern added to DepthFirstTreeBuilder.build
... I haven't yet thought hard about what the final form of this signature should look like, but it would need to contain all the splitter state information necessary for arbitrary split constraints to decide thumbs up or down. The platonic ideal would make that part of a deliberately curated interface.
intp_t n_missing, | ||
bint missing_go_to_left, | ||
float64_t lower_bound, | ||
float64_t upper_bound, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are part of SplitRecord
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# cdef struct HasDataEnv: | ||
# int min_samples | ||
|
||
# cdef bint has_data_condition( | ||
# Splitter splitter, | ||
# SplitRecord* current_split, | ||
# intp_t n_missing, | ||
# bint missing_go_to_left, | ||
# float64_t lower_bound, | ||
# float64_t upper_bound, | ||
# SplitConditionEnv split_condition_env | ||
# ) noexcept nogil: | ||
# cdef HasDataEnv* e = <HasDataEnv*>split_condition_env | ||
# return splitter.n_samples >= e.min_samples | ||
|
||
# cdef class HasDataCondition(SplitCondition): | ||
# def __cinit__(self, int min_samples): | ||
# self.c.f = has_data_condition | ||
# self.c.e = malloc(sizeof(HasDataEnv)) | ||
# (<HasDataEnv*>self.c.e).min_samples = min_samples | ||
|
||
# def __dealloc__(self): | ||
# if self.c.e is not NULL: | ||
# free(self.c.e) | ||
|
||
# super.__dealloc__(self) | ||
|
||
# cdef struct AlphaRegularityEnv: | ||
# float64_t alpha | ||
|
||
# cdef bint alpha_regularity_condition( | ||
# Splitter splitter, | ||
# SplitRecord* current_split, | ||
# intp_t n_missing, | ||
# bint missing_go_to_left, | ||
# float64_t lower_bound, | ||
# float64_t upper_bound, | ||
# SplitConditionEnv split_condition_env | ||
# ) noexcept nogil: | ||
# cdef AlphaRegularityEnv* e = <AlphaRegularityEnv*>split_condition_env | ||
|
||
# return True | ||
|
||
# cdef class AlphaRegularityCondition(SplitCondition): | ||
# def __cinit__(self, float64_t alpha): | ||
# self.c.f = alpha_regularity_condition | ||
# self.c.e = malloc(sizeof(AlphaRegularityEnv)) | ||
# (<AlphaRegularityEnv*>self.c.e).alpha = alpha | ||
|
||
# def __dealloc__(self): | ||
# if self.c.e is not NULL: | ||
# free(self.c.e) | ||
|
||
# super.__dealloc__(self) | ||
|
||
|
||
# from ._tree cimport Tree | ||
# cdef class FooTree(Tree): | ||
# cdef Splitter splitter | ||
|
||
# def __init__(self): | ||
# self.splitter = Splitter( | ||
# presplit_conditions = [HasDataCondition(10)], | ||
# postsplit_conditions = [AlphaRegularityCondition(0.1)], | ||
# ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the lines above outdated fluff we can remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. I mainly left them in as a demonstration of the need for the SplitConditionEnv; none of the currently existing SplitConditions require an env because their env is built into the legacy pattern of the omniscient Splitter. So for example if we wanted to do alpha regularity, alpha would be a hyperparameter that would ideally go into a closure. This is a cython implementation of a closure pattern, specifically one that avoids extension types due to the field lookup overhead.
@@ -59,6 +107,8 @@ cdef class BaseSplitter: | |||
|
|||
cdef const float64_t[:] sample_weight | |||
|
|||
cdef SplitRecordFactoryClosure split_record_factory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is it named Closure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because it is a cython implementation of a closure. C doesn't support closures as a language level feature, but a struct of a function pointer bound with a struct of variable values functions the same.
@@ -485,6 +679,8 @@ cdef inline intp_t node_split_best( | |||
# n_total_constants = n_known_constants + n_found_constants | |||
cdef intp_t n_total_constants = n_known_constants | |||
|
|||
cdef bint conditions_hold = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cdef bint conditions_hold = True | |
cdef bint split_is_valid = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a more explicit name to me
Another dumb question: why is depthfirsttreebuilder need to change, but not bestfirsttreebuilder? @SamuelCarliles3 |
BFTB will most certainly need to change as well, I'm just starting with DFTB, and have not yet gotten to BFTB. IIRC the update functionality had not been added to BFTB(?), and so it did not require an analogous refactor. |
Reference Issues/PRs
What does this implement/fix? Explain your changes.
Includes splitter injection and adds refactor of DepthFirstTreeBuilder.build
Any other comments?