Add example of a sklearn like pipeline #294

cbourjau · 2023-10-23T18:05:42Z

This is an example of how a (possibly) lazy data frame may be use in a sklearn-like pipeline.

The example is motivated by the prospect of a fully lazy, ONNX-based data frame implementation. The concept is that calls to fit are eager. They compute some state that is later meant to be transferred into the ONNX graph/model. That transfer happens when calling transform. The logic within the transform methods is "traced" lazily and the resulting lazy object is then exported to ONNX.

Related to #281 and #249 (comment)

MarcoGorelli

thanks @cbourjau , much appreciated!

spec/API_specification/examples/03_sklearn_like_pipeline.py

cbourjau · 2023-10-24T08:25:05Z

Thank you very much for the review! I addressed your comments and have a new question about get_value now. It would not be unreasonable to expect get_value to be a lazy scalar. Should an implementation stay eager after a call to collect or should it revert to lazy evaluation?

MarcoGorelli · 2023-10-24T08:31:06Z

Before I give my opinion, just to understand better - could you clarify what you mean by lazy scalar please? How would it behave? Could you give an example of how would you use it?

cbourjau · 2023-10-24T11:14:51Z

I am coming to the data frame API with a point of view that it can be implemented on top of an array-api implementation. In such a latter implementation, there may be lazy tensors. A lazy tensor of rank 0 is what I mean when I say "lazy scalar".

An example of its usage can be seen in this example under the assumption that my_dataframe_agnostic_function is compatible with lazy implementations of the standard.

MarcoGorelli · 2023-10-24T11:25:17Z

An example of its usage can be seen in this example under the assumption that my_dataframe_agnostic_function is compatible with lazy implementations of the standard.

exactly what I was just thinking about 😄 In that case, I don't think we even need scalars. Your thoughts on #297 would be much appreciated if possible

MarcoGorelli · 2023-10-25T10:05:15Z

spec/API_specification/examples/03_sklearn_like_pipeline.py

+        for column_name in df.column_names:
+            if not column_name in self.column_names:
+                continue
+            column = df.get_column_by_name(column_name) / self.scalings_[column_name]


@kkraus14 just to clarify about scalar issue - should this be written as

column = df.get_column_by_name(column_name) / float(self.scalings_[column_name])

instead, so that __float__ is called and each implementation knows what to do (e.g. trigger compute, raise, return eager value, ...)?

No, that would force materialization unnecessarily. self.scalings_ items get defined via scalings.get_column_by_name(column_name).get_value(0) which could return a ducktyped lazy scalar, and then this division could be lazy as well.

I.E. how this works in cuDF today is that the get_value(0) call returns a cudf.Scalar object which lives on the GPU, and then the __div__ operation with the column can use that on-GPU value directly.

Adding float(...) requires returning a Python float, which then prevents doing things like the above.

thanks for explaining, OK with not using 'float' here

Sorry to belabour the point, but the "lazy scalar" here is the part which I'm just not getting

A fairly common pattern I used to work with when I did data science was:

train model in some environment (the fit part, potentially expensive)

deploy model somewhere (the transform / predict part, usually much cheaper)

E.g.: train a model somewhere, and save some weights. Then, say on a device (like a smart watch), make predictions using that model (and the saved weights)

By the time you're doing inference (.transform), doesn't exist any more - at least, not in the environment you're doing inference on

So how can something calculated during the fit phase stay lazy?

I think this is what @cbourjau was getting to as well, saying that for fit, they need to materialise values

I think that's an implementation detail of the library in question? The output of fit is effectively self.scalings_ which is a dictionary of ducktyped scalars. Those ducktyped scalars could effectively be an expression and then the __div__ in .transform just continues to build expressions until something triggers materialization, whether that's a call to something like __float__ or __bool__ or something else.

The issue I have is that I would typically do:

fit in one environment

predict in a different environment (e.g. on a smartwatch)

If the "something triggers materialization" happens during predict, then the expression is going to fail, because the dataframe which is meant to be used for training isn't available any more

What I want to do is:

fit in one environment, and force materialisation so I can export the model weights

predict in a different environment, using the model weights calculated during the fit phase

The way the code is written in this PR achieves this by doing

if hasattr(scalings, 'collect'): scalings = scalings.collect()

during the fit phase. And we may have to be OK with this if we can't agree on a solution which can be part of the standard

I don't think that's the only option. I think we want a .may_execute() type method (ignore the exact method name here), which in the example above (and also @cbourjau's ONNX library which has lazy arrays) can simply be do-nothing, while Polars could alias that to .collect(). It would address the concern about implicit materialization, making it explicit.

It wouldn't be super ergonomic to write it as .may_execute().to_array(), but probably better than throwing - which isn't a solution really.

This solution is kinda circling back to a small part of Marco's lazy/eager design, without the separate classes or expression objects.

Other thought on Polars there: it could also choose to do an optimization pass, where e.g. it could figure out for a block of code with multiple may_execute's which ones have to collect and which ones don't. That'd actually be generically useful outside of dataframe API support, because the manual "must execute" imperative nature of .collect is not so easy to write optimally (as we already discovered in another discussion on this repo).

may_execute sounds fine, thanks

btw yesterday I saw a colleague write this

Has anyone seen hvplot causing 3 identical computes when working off of a dask dataframe? (version 0.9.0) I dont know if I should be calling something differently or if this is a bug?

. If I was -1 on automated collection before, now I'm more like -10 😄

Note that in skrub.DatetimeEncoder.fit, even Dask would throw an error telling the user to call .compute

I've added an example of this in #307, as well as a proposal to add maybe_execute as a hint rather than as a directive

Or is it rather a predefined relationship where any dataframe implementation specifies exactly one array API implementation?

I think it's this - each implementation can choose which array library to defer to when someone calls to_array. For cudf it would probably be cupy, for pandas/polars numpy

If we have a 1:1 relationship between the dataframe and the array API standard then we have successfully kicked the can down the road (i.e. tech-stack) into the array API standard. Calling to_array will always trigger materialization if the associated array-API implementation is eager. If the associated array-API implementation is lazy, then it will remain lazy. Would this be acceptable for the Pandas/Polars-backed implementations? I'm afraid I don't quite see the intention behind may_execute?

The above-described semantics would work well with ONNX as far as the dataframe API is concerned. We are left with the same question of explicit materialization on the array API level, though.

Calling to_array will always trigger materialization if the associated array-API implementation is eager. [...] Would this be acceptable for the Pandas/Polars-backed implementations?

I'm afraid not, sorry: #307 (comment)

MarcoGorelli

I think we can update based on recent scalars / persist discussions. But first, just wanted to check something

Realistically, would you want the output of .fit to be lazy?

I think you'd typically want it to be guaranteed to not be lazy. In my experience in data science, what was typical was:

fit model once (this can be expensive)
predict / transform multiple times, on multiple datasets (this is typically cheaper)

So, to be honest, I would want fit to evaluate everything it needs to, so that predict / transform can be cheap

Does this match your experience?

If so, maybe we should write

       self.scalings_ = {
            column_name: float(scalings.get_column_by_name(column_name).get_value(0))
            for column_name in self.column_names
        }

, in which case we'd need to add __float__ to the Scalar class. Assuming you even want a float

Adding persist inside fit isn't really ideal, because it means that scalings.col(column_name).get_value(0) will be repeatedly evaluated inside each transform call

kkraus14 · 2023-11-14T20:42:01Z

Adding persist inside fit isn't really ideal, because it means that scalings.col(column_name).get_value(0) will be repeatedly evaluated inside each transform call

Could you do scalings.col(column_name).get_value(0).persist()? I imagine we'd need the persist method on every class type we have.

MarcoGorelli · 2023-11-15T11:30:53Z

Scalar.persist sounds fine

…e_example

MarcoGorelli · 2023-12-04T16:51:39Z

I've pushed some commits to update this based on other recent API changes, anyone have objections?

MarcoGorelli · 2023-12-07T21:58:17Z

approved during the call, thanks @cbourjau !

Add example of a sklearn like pipeline

6d93058

cbourjau mentioned this pull request Oct 23, 2023

Separate eager and lazy APIs #249

Closed

MarcoGorelli suggested changes Oct 24, 2023

View reviewed changes

Review comments

42d454a

Update comment after discussion in data-apis#279

33c39c8

MarcoGorelli reviewed Oct 25, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Oct 30, 2023

RFC, WIP: demo: use dataframe api in datetimeencoder skrub-data/skrub#786

Closed

MarcoGorelli added the documentation Improvements or additions to documentation label Nov 1, 2023

MarcoGorelli mentioned this pull request Nov 6, 2023

Dealing with if scalar #315

Closed

MarcoGorelli reviewed Nov 14, 2023

View reviewed changes

MarcoGorelli mentioned this pull request Nov 16, 2023

Add details to expectations for scalars #308

Merged

MarcoGorelli added 4 commits November 22, 2023 20:08

Merge remote-tracking branch 'upstream/main' into add_sklearn_pipelin…

3f0ab97

…e_example

Merge remote-tracking branch 'upstream/main' into add_sklearn_pipelin…

c338eec

…e_example

Merge remote-tracking branch 'upstream/main' into add_sklearn_pipelin…

d458d3a

…e_example

update

c87ebf4

MarcoGorelli approved these changes Dec 7, 2023

View reviewed changes

MarcoGorelli merged commit 21271f5 into data-apis:main Dec 7, 2023
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add example of a sklearn like pipeline #294

Add example of a sklearn like pipeline #294

cbourjau commented Oct 23, 2023

MarcoGorelli left a comment

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

MarcoGorelli Oct 25, 2023

kkraus14 Oct 25, 2023

MarcoGorelli Oct 25, 2023 •

edited

Loading

kkraus14 Oct 25, 2023 •

edited

Loading

MarcoGorelli Oct 25, 2023

rgommers Oct 30, 2023

MarcoGorelli Oct 31, 2023

MarcoGorelli Oct 31, 2023 •

edited

Loading

cbourjau Nov 6, 2023

MarcoGorelli Nov 6, 2023

MarcoGorelli left a comment •

edited

Loading

kkraus14 commented Nov 14, 2023

MarcoGorelli commented Nov 15, 2023 •

edited

Loading

MarcoGorelli commented Dec 4, 2023

MarcoGorelli commented Dec 7, 2023

Add example of a sklearn like pipeline #294

Add example of a sklearn like pipeline #294

Conversation

cbourjau commented Oct 23, 2023

MarcoGorelli left a comment

Choose a reason for hiding this comment

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

cbourjau commented Oct 24, 2023

MarcoGorelli commented Oct 24, 2023

MarcoGorelli Oct 25, 2023

Choose a reason for hiding this comment

kkraus14 Oct 25, 2023

Choose a reason for hiding this comment

MarcoGorelli Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

kkraus14 Oct 25, 2023 • edited Loading

Choose a reason for hiding this comment

MarcoGorelli Oct 25, 2023

Choose a reason for hiding this comment

rgommers Oct 30, 2023

Choose a reason for hiding this comment

MarcoGorelli Oct 31, 2023

Choose a reason for hiding this comment

MarcoGorelli Oct 31, 2023 • edited Loading

Choose a reason for hiding this comment

cbourjau Nov 6, 2023

Choose a reason for hiding this comment

MarcoGorelli Nov 6, 2023

Choose a reason for hiding this comment

MarcoGorelli left a comment • edited Loading

Choose a reason for hiding this comment

kkraus14 commented Nov 14, 2023

MarcoGorelli commented Nov 15, 2023 • edited Loading

MarcoGorelli commented Dec 4, 2023

MarcoGorelli commented Dec 7, 2023

MarcoGorelli Oct 25, 2023 •

edited

Loading

kkraus14 Oct 25, 2023 •

edited

Loading

MarcoGorelli Oct 31, 2023 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli commented Nov 15, 2023 •

edited

Loading