-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
change assumed treat_as_categorical
#397
Comments
Hello Duncan, Thank you for taking the time to report this issue. We have made this treatment because, in our testing, some numerical features with low numbers of unique values yielded suboptimal results when used with the continuous univariate drift methods. We saw that even if a variable is strictly continuous, if; for some reason, the actual unique values present in that variable are low, then categorical univariate drift methods more accurately described the observed drift. However, as with many things in data, this is situational and could be suboptimal for other cases. From your description, it looks like you may have such a situation with your dataset. I wonder if you can share more about your dataset and how you used it (for example which drift method yielded large drift scores) or create code that creates a similar synthetic reproducible example. It would help to see if there our critertion for treating some variables as categoricals could be updated to accommodate your use case, or if it completely fails there. After that, it would be easier to consider how to update the library. I doubt though, that we would want to completely remove current behavior like you recommend at #398. |
That's an interesting example, we'll take a peek into that. |
When using JS distance with UnivariateDriftCalculator with a small number of unique values in a continuous column, currently the library decides to treat it as categorical, which I can sort of understand? However, if the user knows that a feature is continuous and wants it to be treated as such, there is no option. The problem I'm seeing is that a small change in values in analysis leads to a large drift score, because these floats aren't equal to the categories.
Describe the solution you'd like
treat_as_categorical
, then don't assume it's categorical. Let the user decide. Maybe add something to the docs or a warning if the number of unique values is low.Describe alternatives you've considered
UnivariateDriftCalculator
calledtreat_as_continuous: List[str]
orconvert_continuous: bool
something else, that when set is passed to eachMethod
.np.number
type is passed to a method with treat_as_categorical set to True, round the data to the closest categories.The text was updated successfully, but these errors were encountered: