propose solution to remove hidden feature type conversion in Univariate Drift #398

Duncan-Hunter · 2024-06-17T19:53:59Z

Converting a feature that is continuous to a categorical feature, even if there are few unique values, may lead to problems. I think it's better to not do this conversion without a warning, and better to not do it at all.

If a user wants a continuous column with few unique values, they can set it as a categorical column. Happy to discuss it, work through it, be convinced I'm wrong etc.

Related to #397 .

nnansters · 2024-06-19T18:24:40Z

Hey Duncan, quickly wanted to chime in here.

I agree that having this conversion happen more explicitly would be better. We're always walking a fine line between making stuff easy and making it explicit. This should be explicit.

One thing holding us back here however is our current implementation that runs this single calculator on a whole set of columns, for a whole list of univariate drift methods. The changes you proposed definitely work on the Method level, but we don't expose that interface directly. Method instance are created within the calculator. So the calculator would have to take in a parameter stating for each column if it is continuous or categorical. As we might deal with potentially hundreds of columns, this might get cumbersome.

We would like to split everything up so you can just create and use Method instances directly. That way it would be a lot easier to specify what columns should be calculated by each. This change is linked to a big refactor we're designing, but we don't have a timeline for it yet.

Maybe for now we can make the conversion a bit more explicit, or at least not hidden away, by issuing a warning, as you did in your PR?

What do you think?

Duncan-Hunter · 2024-06-20T09:13:24Z

I like the idea of opening up the Method instance. If a method is created that isn't possible / works poorly, a warning or error should be raised. The way I think I would want it as a user is for there to be a few levels of granularity.

At the top you have the UnivariateDriftCalculator, which you can fit on your dataset without specifying data types or methods. It figures out which datatypes you have (which you already do to an extent), and if it should be categorical or continuous, then applies the default method.
Then you can specify categorical / continuous column names, and method names, as is currently done. This skips the automatic detection, and is explicit. Again, the Method would give warnings/raise errors if the user has done something they shouldn't. If the user hasn't specified a column, it's detected automatically.
Most granular, the user provides a dict of {str: List[Method]} that they've created. If they don't provide a method for a column, it's detected automatically.

"So the calculator would have to take in a parameter stating for each column if it is continuous or categorical." I think the solution here is to have a good way of finding a default dtype + method without user input, which you're already most of the way there.

For now, a warning is fine. It'd also be nice if the UnivariateDriftCalculator could update it's continuous/categorical columns post fitting, so iterate over columns, if its JS or Hellinger, check the _treat_as_type attribute and update accordingly. I'm on holiday from tomorrow and all of next week, so I might not update this PR soon.

In my own work now, I'll implement a check to see if this behaviour is there, then just round the floats to their nearest bin.

nnansters · 2024-07-05T09:24:29Z

Hey @Duncan-Hunter , opened up a "replacement" PR for this. If you're good with it, we can close this one.

propose solution to remove hidden feature type conversion

6bd4795

Duncan-Hunter requested review from nnansters and nikml as code owners June 17, 2024 19:53

nikml mentioned this pull request Jun 19, 2024

change assumed treat_as_categorical #397

Closed

Duncan-Hunter closed this Jun 19, 2024

nnansters reopened this Jun 19, 2024

nnansters mentioned this pull request Jul 5, 2024

Explicit data type conversion for univariate drift #404

Merged

Duncan-Hunter closed this Jul 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

propose solution to remove hidden feature type conversion in Univariate Drift #398

propose solution to remove hidden feature type conversion in Univariate Drift #398

Duncan-Hunter commented Jun 17, 2024

nnansters commented Jun 19, 2024 •

edited

Loading

Duncan-Hunter commented Jun 20, 2024

nnansters commented Jul 5, 2024

propose solution to remove hidden feature type conversion in Univariate Drift #398

propose solution to remove hidden feature type conversion in Univariate Drift #398

Conversation

Duncan-Hunter commented Jun 17, 2024

nnansters commented Jun 19, 2024 • edited Loading

Duncan-Hunter commented Jun 20, 2024

nnansters commented Jul 5, 2024

nnansters commented Jun 19, 2024 •

edited

Loading