Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change assumed treat_as_categorical #397

Closed
Duncan-Hunter opened this issue Jun 17, 2024 · 4 comments
Closed

change assumed treat_as_categorical #397

Duncan-Hunter opened this issue Jun 17, 2024 · 4 comments
Assignees
Labels
enhancement New feature or request

Comments

@Duncan-Hunter
Copy link
Contributor

When using JS distance with UnivariateDriftCalculator with a small number of unique values in a continuous column, currently the library decides to treat it as categorical, which I can sort of understand? However, if the user knows that a feature is continuous and wants it to be treated as such, there is no option. The problem I'm seeing is that a small change in values in analysis leads to a large drift score, because these floats aren't equal to the categories.

Describe the solution you'd like

  • Simply don't do this conversion from continuous to categorical - if a column isn't in treat_as_categorical, then don't assume it's categorical. Let the user decide. Maybe add something to the docs or a warning if the number of unique values is low.

Describe alternatives you've considered

  • Add an argument to UnivariateDriftCalculator called treat_as_continuous: List[str] or convert_continuous: bool something else, that when set is passed to each Method.
  • If a np.number type is passed to a method with treat_as_categorical set to True, round the data to the closest categories.
@nikml
Copy link
Contributor

nikml commented Jun 19, 2024

Hello Duncan,

Thank you for taking the time to report this issue. We have made this treatment because, in our testing, some numerical features with low numbers of unique values yielded suboptimal results when used with the continuous univariate drift methods. We saw that even if a variable is strictly continuous, if; for some reason, the actual unique values present in that variable are low, then categorical univariate drift methods more accurately described the observed drift. However, as with many things in data, this is situational and could be suboptimal for other cases. From your description, it looks like you may have such a situation with your dataset. I wonder if you can share more about your dataset and how you used it (for example which drift method yielded large drift scores) or create code that creates a similar synthetic reproducible example. It would help to see if there our critertion for treating some variables as categoricals could be updated to accommodate your use case, or if it completely fails there. After that, it would be easier to consider how to update the library. I doubt though, that we would want to completely remove current behavior like you recommend at #398.

@Duncan-Hunter
Copy link
Contributor Author

Hi, thanks for getting back to me.

That's a good reason for doing that, and yeah the user should consider using a categorical method. In this case (small number of unique values), does JS as a categorical method work well enough? Should the user be informed that other methods might be more appropriate?

I 100% think that there should be at least a warning during fitting that the feature is being treated as categorical.

A potential fix in this scenario, is to round the incoming floats to their closest bin value? It's not really ideal but it can be done.

Here's a use case where I run into problems. I have a reference dataset with a small number of floating point values, and for the sake of argument, they've all been shifted by a tiny amount in analysis. The drift calculator then returns 1 for every chunk despite the change being very small. The change can be even smaller, of course, and still yield this result.

from nannyml.drift import UnivariateDriftCalculator
import numpy as np
import pandas as pd

reference_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=10_000)})

analysis_data = pd.DataFrame(data={
    "x": np.random.randint(low=5, high=8, size=6_000)})

reference_data["x"] = reference_data["x"].astype(float)
analysis_data["x"] = np.clip(analysis_data["x"].astype(float) + 0.01, a_min=5, a_max=8)

calculator = UnivariateDriftCalculator(
    column_names=["x"],
    continuous_methods=['jensen_shannon'],
    chunk_size=1_000
)
calculator = calculator.fit(reference_data)
results = calculator.calculate(analysis_data)
print("Continuous column names: ", calculator.continuous_column_names)
print(calculator._column_to_models_mapping['x'][0]._treat_as_type)
results.filter(period='analysis').to_df(multilevel=True)
Continuous column names:  ['x']
cat
image

@nnansters
Copy link
Contributor

That's an interesting example, we'll take a peek into that.

@Duncan-Hunter
Copy link
Contributor Author

#404

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants