Numpy Datatypes in Conditional Distributions of Description File #44

CodingDepot · 2024-12-17T14:45:26Z

DataSynthesizer version: 0.1.13
Python version: 3.9
Operating System: Windows 11

Description

Trying to create a synthetic dataset from the Kaggle adult census dataset (with the fnlwgt column removed) in the correlated attribute mode results in the generator failing to parse the description file.

The reason for this seems to be in L281 of PrivBayes.py:

parents_key = str([parents_instance]) if len(parents) == 1 else str(list(parents_instance))

This resolves int types as np.int64(0) instead of just 0 for parents > 1 . This in turn causes L99 of the DataGenerator to fail, as it does not import numpy:

parents_instance = list(eval(parents_instance))

I could fix it locally by simply adding import numpy as np to the DataGenerator.py file, but maybe it would be cleaner to correctly print the base int type into the description file in the first place.

The relevant section of the description file:

"conditional_probabilities": {
        "income": [
            0.6269945618560558,
            0.37300543814394416
        ],
        "relationship": {
            "[0]": [
                0.31958572087575393,
                0.26864155111683646,
                0.062246949021475276,
                0.17143605132431283,
                0.1260161383716099,
                0.05207358929001161
            ],
            "[1]": [
                0.4276133198945046,
                0.16299606959384128,
                0.027753228447322167,
                0.17927266942607956,
                0.12404103847621967,
                0.07832367416203281
            ]
        },
        "sex": {
            "[np.int64(0), np.int64(0)]": [
                0.11899038829847323,
                0.8810096117015268
            ],
            "[np.int64(0), np.int64(1)]": [
                0.1370384306577154,
                0.8629615693422846
            ],

What I Did

Python script:

import os.path

import pandas as pd
from DataSynthesizer.DataDescriber import DataDescriber
from DataSynthesizer.DataGenerator import DataGenerator

from generators.generator import Generator

class PrivBayesGenerator(Generator):
    def generate(self, rows: int=None):
        input_data = str(self.real_data_path)
        description_file = str(self.real_data_path.parent / 'description.json')
        synthetic_data = self.synthetic_data_path

        epsilon = 0.1
        if rows is None:
            rows = pd.read_csv(input_data).shape[0]
        threshold_value = 50
        num_tuples_to_generate = rows

        # Describe Dataset
        if not os.path.exists(description_file):
            describer = DataDescriber(category_threshold=threshold_value)
            describer.describe_dataset_in_correlated_attribute_mode(input_data, epsilon=epsilon)
            describer.save_dataset_description_to_file(description_file)

        # Generate Synthetic Data
        generator = DataGenerator()
        generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
        generator.save_synthetic_data(synthetic_data)

Traceback:

Traceback (most recent call last):
  File "D:\...\helpers\generate_main.py", line 27, in <module>
    main()
  File "D:\...\helpers\generate_main.py", line 21, in main
    generator.generate(rows)
  File "D:\...\generators\priv_bayes_generator.py", line 35, in generate
    generator.generate_dataset_in_correlated_attribute_mode(num_tuples_to_generate, description_file)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 66, in generate_dataset_in_correlated_attribute_mode
    self.encoded_dataset = DataGenerator.generate_encoded_dataset(self.n, self.description)
  File "D:\...\venv3.9\lib\site-packages\DataSynthesizer\DataGenerator.py", line 100, in generate_encoded_dataset
    parents_instance = list(eval(parents_instance))
  File "<string>", line 1, in <module>
NameError: name 'np' is not defined

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Numpy Datatypes in Conditional Distributions of Description File #44

Numpy Datatypes in Conditional Distributions of Description File #44

CodingDepot commented Dec 17, 2024

Numpy Datatypes in Conditional Distributions of Description File #44

Numpy Datatypes in Conditional Distributions of Description File #44

Comments

CodingDepot commented Dec 17, 2024

Description

What I Did