Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Numpy Datatypes in Conditional Distributions of Description File #44

Open
CodingDepot opened this issue Dec 17, 2024 · 0 comments
Open

Comments

@CodingDepot
Copy link

  • DataSynthesizer version: 0.1.13
  • Python version: 3.9
  • Operating System: Windows 11

Description

Trying to create a synthetic dataset from the Kaggle adult census dataset (with the fnlwgt column removed) in the correlated attribute mode results in the generator failing to parse the description file.

The reason for this seems to be in L281 of PrivBayes.py:

parents_key = str([parents_instance]) if len(parents) == 1 else str(list(parents_instance))

This resolves int types as np.int64(0) instead of just 0 for parents > 1 . This in turn causes L99 of the DataGenerator to fail, as it does not import numpy:

parents_instance = list(eval(parents_instance))

I could fix it locally by simply adding import numpy as np to the DataGenerator.py file, but maybe it would be cleaner to correctly print the base int type into the description file in the first place.

The relevant section of the description file:

"conditional_probabilities": {
        "income": [
            0.6269945618560558,
            0.37300543814394416
        ],
        "relationship": {
            "[0]": [
                0.31958572087575393,
                0.26864155111683646,
                0.062246949021475276,
                0.17143605132431283,
                0.1260161383716099,
                0.05207358929001161
            ],
            "[1]": [
                0.4276133198945046,
                0.16299606959384128,
                0.027753228447322167,
                0.17927266942607956,
                0.12404103847621967,
                0.07832367416203281
            ]
        },
        "sex": {
            "[np.int64(0), np.int64(0)]": [
                0.11899038829847323,
                0.8810096117015268
            ],
            "[np.int64(0), np.int64(1)]": [
                0.1370384306577154,
                0.8629615693422846
            ],

What I Did

Python script:

import os.path

import pandas as pd
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator

from generators.generator import Generator

class PrivBayesGenerator(Generator):
    def generate(self, rows: int=None):
        input_data = str(self.real_data_path)
        description_file = str(self.real_data_path.parent / 'description.json')
        synthetic_data = self.synthetic_data_path

        epsilon = 0.1
        if rows is None:
            rows = pd.read_csv(input_data).shape[0]
        threshold_value = 50
        num_tuples_to_generate = rows

        # Describe Dataset
        if not os.path.exists(description_file):
            describer = DataDescriber(category_threshold=threshold_value)
            describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon=epsilon)
            describer.save_dataset_description_to_file(description_file)

        # Generate Synthetic Data
        generator = DataGenerator()
        generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
        generator.save_synthetic_data(synthetic_data)

Traceback:

Traceback (most recent call last):
  File "D:\...\helpers\generate_main.py", line 27, in <module>
    main()
  File "D:\...\helpers\generate_main.py", line 21, in main
    generator.generate(rows)
  File "D:\...\generators\priv_bayes_generator.py", line 35, in generate
    generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 66, in generate_dataset_in_correlated_attribute_mode
    self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 100, in generate_encoded_dataset
    parents_instance = list(eval(parents_instance))
  File "<string>", line 1, in <module>
NameError: name 'np' is not defined
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant