Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

propose solution to remove hidden feature type conversion in Univariate Drift #398

Closed

Conversation

Duncan-Hunter
Copy link
Contributor

Converting a feature that is continuous to a categorical feature, even if there are few unique values, may lead to problems. I think it's better to not do this conversion without a warning, and better to not do it at all.

If a user wants a continuous column with few unique values, they can set it as a categorical column. Happy to discuss it, work through it, be convinced I'm wrong etc.

Related to #397 .

@nnansters
Copy link
Contributor

nnansters commented Jun 19, 2024

Hey Duncan, quickly wanted to chime in here.

I agree that having this conversion happen more explicitly would be better. We're always walking a fine line between making stuff easy and making it explicit. This should be explicit.

One thing holding us back here however is our current implementation that runs this single calculator on a whole set of columns, for a whole list of univariate drift methods. The changes you proposed definitely work on the Method level, but we don't expose that interface directly. Method instance are created within the calculator. So the calculator would have to take in a parameter stating for each column if it is continuous or categorical. As we might deal with potentially hundreds of columns, this might get cumbersome.

We would like to split everything up so you can just create and use Method instances directly. That way it would be a lot easier to specify what columns should be calculated by each. This change is linked to a big refactor we're designing, but we don't have a timeline for it yet.

Maybe for now we can make the conversion a bit more explicit, or at least not hidden away, by issuing a warning, as you did in your PR?

What do you think?

@nnansters nnansters reopened this Jun 19, 2024
@Duncan-Hunter
Copy link
Contributor Author

I like the idea of opening up the Method instance. If a method is created that isn't possible / works poorly, a warning or error should be raised. The way I think I would want it as a user is for there to be a few levels of granularity.

  • At the top you have the UnivariateDriftCalculator, which you can fit on your dataset without specifying data types or methods. It figures out which datatypes you have (which you already do to an extent), and if it should be categorical or continuous, then applies the default method.
  • Then you can specify categorical / continuous column names, and method names, as is currently done. This skips the automatic detection, and is explicit. Again, the Method would give warnings/raise errors if the user has done something they shouldn't. If the user hasn't specified a column, it's detected automatically.
  • Most granular, the user provides a dict of {str: List[Method]} that they've created. If they don't provide a method for a column, it's detected automatically.

"So the calculator would have to take in a parameter stating for each column if it is continuous or categorical." I think the solution here is to have a good way of finding a default dtype + method without user input, which you're already most of the way there.

For now, a warning is fine. It'd also be nice if the UnivariateDriftCalculator could update it's continuous/categorical columns post fitting, so iterate over columns, if its JS or Hellinger, check the _treat_as_type attribute and update accordingly. I'm on holiday from tomorrow and all of next week, so I might not update this PR soon.

In my own work now, I'll implement a check to see if this behaviour is there, then just round the floats to their nearest bin.

@nnansters
Copy link
Contributor

Hey @Duncan-Hunter , opened up a "replacement" PR for this. If you're good with it, we can close this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants