PrivSyn: Differentially Private Data Synthesis

This code has been published as an implementation of PrivSyn: Differentially Private Data Synthesis for a course project.

To run the code, one has to perform two major steps,

Dataset Preparation
Notebook Preparation

Dataset Preparation

This task is not covered within the code and has to be performed manually specific to each dataset beforehand. There are five datasets used within the code 'Polish', 'UCI Adult', 'Covid Symptoms', 'Diabetes' and 'US Accidental Drug Deaths'. These may be referenced for formatting. Essentially, each dataset needs the corresponding csv file, and three configuration related files. These configuration files are necessary for data loading and postprocessing after the algorithm is run. Each of the configuration files is used in the code as follows,

loading_data.json is used for an overview of each attribute
data.yaml is used to handle identifier attributes, numeric attributes (binning) and also grouping attributes (which is implemented but commented out as it seldom leads to errors)
column_info.json is used for datatypes and nan-value handling

Notebook Preparation

Again, the sample notebooks for three datasets have been attached. In those files, primarily the Dataloader class call and Anonymiser class call need to be modified in accordance with the appropriate file paths and epsilon/delta for differential privacy.

Code Structure

The flow of code is as follows,

Dataset-related files are passed to dataloading which first converts them to an appropriate form, such as dealing with missing, categorial and continuous values/attributes, and then calculates the indifference metric for all pairs of attributes (and all inherently included pre-processing such as one-way and two-way marginals).
The next function (Anonymiser) applies noising to the dataset, which includes one-way marginal noising by the Gaussian Mechanism, two-way marginal selection (DenseMarg), marginal combination and two/multi-way marginal noising by the Gaussian Mechanism
The next call is to Consistenter which is used to consist the marginals so as to satisfy the requirements of a probability distribution using Norm-Sub for l_2 consistency
Thereafter, a call is made to the Gradually Update Method (GUM) which also has a subfolder. The subfolder consists of View.py (which is a class defined for ease of processing), ViewConsistenter.py (which is defined for a second iteration of marginal consistenting before GUM is called) and RecordSynthesiser.py which is the main call to the GUM method
Finally, PostProcessor.py converts the pandas dataset back to the same format in which the data was input (such as unbinning continuous attributes or decoding categorical attributes) Other files by the name of tempAnonymisation.py and main.py are present as initial versions of the code which were then removed in favour of Anonymisation.py and using separate .ipynb notebooks for each dataset (as opposed to a command-line interface).

Requirements

The following libraries are required for the functioning of the code,

Pyyaml
Numpy
Pandas
Networkx
Scikit-Learn

The code was run a MacOS with osx-arm64 compiler

Acknowledgements

A code implementation of DPSyn, an algorithm similar to PrivSyn and developed by the same authors, is also available on Github at this link

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
AccidentalDrugDeaths		AccidentalDrugDeaths
CovidSymptoms		CovidSymptoms
Diabetes		Diabetes
GraduallyUpdateMethod		GraduallyUpdateMethod
Polish		Polish
UCIMLAdult		UCIMLAdult
Unused		Unused
.DS_Store		.DS_Store
AccidentalDrugDeaths.ipynb		AccidentalDrugDeaths.ipynb
Anonymisation.py		Anonymisation.py
Consistenter.py		Consistenter.py
CovidSymptoms.ipynb		CovidSymptoms.ipynb
DataLoader.py		DataLoader.py
Diabetes.ipynb		Diabetes.ipynb
GUM.py		GUM.py
Polish.ipynb		Polish.ipynb
PostProcessor.py		PostProcessor.py
PrivSyn Report.pdf		PrivSyn Report.pdf
README.md		README.md
UCIMLAdult.ipynb		UCIMLAdult.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PrivSyn: Differentially Private Data Synthesis

Dataset Preparation

Notebook Preparation

Code Structure

Requirements

Acknowledgements

About

Releases

Packages

Languages

sukjingitsit/PrivSyn

Folders and files

Latest commit

History

Repository files navigation

PrivSyn: Differentially Private Data Synthesis

Dataset Preparation

Notebook Preparation

Code Structure

Requirements

Acknowledgements

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages