-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recommendations for categorical variables with many distinct values? #9
Comments
Thanks for giving us your feedback! This |
My current value for |
The code does not generate a big table for all categorical attributes. Instead, it generates a conditional probability table for each parents-child attribute pair. Assume all categorical attributes in your dataset have the same domain size 200. When there are two parents for an attribute, its conditional probability table will cost So |
Thanks! I'll try it, but it turns out that for some categorical columns the number of distinct values is almost equal to the number of rows. I'm testing on 10k rows, but in production the full data is 3M rows. |
I have data that contains a dozen or so columns that contain categorical variables with many distinct values. When I try to run in correlated attribute mode, line 162 in PrivBayes.py results in
MemoryError
: I run out of RAM. Here's the line:Without knowing much about the workings of the code, it looks like it's taking cartesian product over all the columns in my data so that DataDescriber can learn the joint distributions of subsets of (categorical) data selected by the Bayesian network finding routine, some of which have thousands of distinct values, and consequently the line above is trying to generate massive tables that are maxing-out RAM. That's just a guess.
Other than running in independent attribute mode, do you have any recommendations for how to proceed when data contains many categorical variables with many distinct values? It's often not possible to merge values by binning them into a smaller number of distinct values. For example, my
PlaceOfBirth
column contains (city, country) combinations that result in a huge number of values, and it's neither meaningful to separate the city and country information, nor bin the data, and I'd like to retain the relationships to other columns in the data, such asPassport
orCitizenship
, because they are meaningful. This is just an example of a general problem that I've encountered. One might often encounter data like this.Is there a technical workaround, or is there some best practice that you can recommend for dealing with this situation --- how can I use your software optimally? Anything documented?
The text was updated successfully, but these errors were encountered: