-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add example of a sklearn like pipeline #294
Add example of a sklearn like pipeline #294
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @cbourjau , much appreciated!
Thank you very much for the review! I addressed your comments and have a new question about |
Before I give my opinion, just to understand better - could you clarify what you mean by lazy scalar please? How would it behave? Could you give an example of how would you use it? |
I am coming to the data frame API with a point of view that it can be implemented on top of an array-api implementation. In such a latter implementation, there may be lazy tensors. A lazy tensor of rank 0 is what I mean when I say "lazy scalar". An example of its usage can be seen in this example under the assumption that |
exactly what I was just thinking about 😄 In that case, I don't think we even need scalars. Your thoughts on #297 would be much appreciated if possible |
for column_name in df.column_names: | ||
if not column_name in self.column_names: | ||
continue | ||
column = df.get_column_by_name(column_name) / self.scalings_[column_name] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@kkraus14 just to clarify about scalar issue - should this be written as
column = df.get_column_by_name(column_name) / float(self.scalings_[column_name])
instead, so that __float__
is called and each implementation knows what to do (e.g. trigger compute, raise, return eager value, ...)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, that would force materialization unnecessarily. self.scalings_
items get defined via scalings.get_column_by_name(column_name).get_value(0)
which could return a ducktyped lazy scalar, and then this division could be lazy as well.
I.E. how this works in cuDF today is that the get_value(0)
call returns a cudf.Scalar
object which lives on the GPU, and then the __div__
operation with the column can use that on-GPU value directly.
Adding float(...)
requires returning a Python float, which then prevents doing things like the above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for explaining, OK with not using 'float' here
Sorry to belabour the point, but the "lazy scalar" here is the part which I'm just not getting
A fairly common pattern I used to work with when I did data science was:
- train model in some environment (the
fit
part, potentially expensive) - deploy model somewhere (the
transform
/predict
part, usually much cheaper)
E.g.: train a model somewhere, and save some weights. Then, say on a device (like a smart watch), make predictions using that model (and the saved weights)
By the time you're doing inference (.transform
), doesn't exist any more - at least, not in the environment you're doing inference on
So how can something calculated during the fit
phase stay lazy?
I think this is what @cbourjau was getting to as well, saying that for fit
, they need to materialise values
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's an implementation detail of the library in question? The output of fit
is effectively self.scalings_
which is a dictionary of ducktyped scalars. Those ducktyped scalars could effectively be an expression and then the __div__
in .transform
just continues to build expressions until something triggers materialization, whether that's a call to something like __float__
or __bool__
or something else.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The issue I have is that I would typically do:
fit
in one environmentpredict
in a different environment (e.g. on a smartwatch)
If the "something triggers materialization" happens during predict
, then the expression is going to fail, because the dataframe which is meant to be used for training isn't available any more
What I want to do is:
fit
in one environment, and force materialisation so I can export the model weightspredict
in a different environment, using the model weights calculated during thefit
phase
The way the code is written in this PR achieves this by doing
if hasattr(scalings, 'collect'):
scalings = scalings.collect()
during the fit
phase. And we may have to be OK with this if we can't agree on a solution which can be part of the standard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that's the only option. I think we want a .may_execute()
type method (ignore the exact method name here), which in the example above (and also @cbourjau's ONNX library which has lazy arrays) can simply be do-nothing, while Polars could alias that to .collect()
. It would address the concern about implicit materialization, making it explicit.
It wouldn't be super ergonomic to write it as .may_execute().to_array()
, but probably better than throwing - which isn't a solution really.
This solution is kinda circling back to a small part of Marco's lazy/eager design, without the separate classes or expression objects.
Other thought on Polars there: it could also choose to do an optimization pass, where e.g. it could figure out for a block of code with multiple may_execute
's which ones have to collect and which ones don't. That'd actually be generically useful outside of dataframe API support, because the manual "must execute" imperative nature of .collect
is not so easy to write optimally (as we already discovered in another discussion on this repo).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may_execute
sounds fine, thanks
btw yesterday I saw a colleague write this
Has anyone seen hvplot causing 3 identical computes when working off of a dask dataframe? (version 0.9.0) I dont know if I should be calling something differently or if this is a bug?
. If I was -1 on automated collection before, now I'm more like -10 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that in skrub.DatetimeEncoder.fit
, even Dask would throw an error telling the user to call .compute
I've added an example of this in #307, as well as a proposal to add maybe_execute
as a hint rather than as a directive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or is it rather a predefined relationship where any dataframe implementation specifies exactly one array API implementation?
I think it's this - each implementation can choose which array library to defer to when someone calls to_array. For cudf it would probably be cupy, for pandas/polars numpy
If we have a 1:1 relationship between the dataframe and the array API standard then we have successfully kicked the can down the road (i.e. tech-stack) into the array API standard. Calling to_array
will always trigger materialization if the associated array-API implementation is eager. If the associated array-API implementation is lazy, then it will remain lazy. Would this be acceptable for the Pandas/Polars-backed implementations? I'm afraid I don't quite see the intention behind may_execute
?
The above-described semantics would work well with ONNX as far as the dataframe API is concerned. We are left with the same question of explicit materialization on the array API level, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Calling to_array will always trigger materialization if the associated array-API implementation is eager. [...] Would this be acceptable for the Pandas/Polars-backed implementations?
I'm afraid not, sorry: #307 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can update based on recent scalars / persist
discussions. But first, just wanted to check something
Realistically, would you want the output of .fit
to be lazy?
I think you'd typically want it to be guaranteed to not be lazy. In my experience in data science, what was typical was:
- fit model once (this can be expensive)
- predict / transform multiple times, on multiple datasets (this is typically cheaper)
So, to be honest, I would want fit
to evaluate everything it needs to, so that predict
/ transform
can be cheap
Does this match your experience?
If so, maybe we should write
self.scalings_ = {
column_name: float(scalings.get_column_by_name(column_name).get_value(0))
for column_name in self.column_names
}
, in which case we'd need to add __float__
to the Scalar class. Assuming you even want a float
Adding persist
inside fit
isn't really ideal, because it means that scalings.col(column_name).get_value(0)
will be repeatedly evaluated inside each transform
call
Could you do |
|
I've pushed some commits to update this based on other recent API changes, anyone have objections? |
approved during the call, thanks @cbourjau ! |
This is an example of how a (possibly) lazy data frame may be use in a sklearn-like pipeline.
The example is motivated by the prospect of a fully lazy, ONNX-based data frame implementation. The concept is that calls to
fit
are eager. They compute some state that is later meant to be transferred into the ONNX graph/model. That transfer happens when callingtransform
. The logic within thetransform
methods is "traced" lazily and the resulting lazy object is then exported to ONNX.Related to #281 and #249 (comment)