Python pipeline to perform data synthesis, aggregation and evaluation.
To run the pipeline you will need to build the python bindings for the sds core library. To do so, please follow the steps described in here.
Make sure you are on the root of the sds python package (packages/python-pipeline) and run the following command to install the required python dependencies:
> pip install -r requirements.txt
The pipeline is controlled via a json config file containing a variety of parameters, as shown below with illustrative values:
{
"sensitive_microdata_path": "./secret_vices.csv",
"sensitive_microdata_delimiter": ",",
"use_columns": [],
"record_limit": -1,
"sensitive_zeros": [],
"reporting_resolution": 10,
"reporting_length": 5,
"seeded": true,
"parallel_jobs": 8,
"memory_limit_pct": 95,
"cache_max_size": 100000,
"output_dir": "./vices_output",
"prefix": "vices",
"report_title": "Secret Vices Dataset",
"report_visuals": {
"vices": ["chocolate:1", "beer:1", "napping:1", ...],
...
},
"report_pages": {
"By age/gender": ["age", "gender", "vices"],
"By city/job": ["city", "job", "vices"],
...
}
}
Use of each of these parameters is described in the following sections.
The data at sensitive_microdata_path
should be in comma separated values (.csv) or tab separated values (.tsv) format, with the sensitive_microdata_delimiter
set accordingly (e.g., ","
or "\t"
).
The pipeline assumes deidentified microdata as input, i.e., a table in which each row contains all data (but no personally-identifiable information, PII) relating to an individual.
Single-valued attributes (e.g., gender) are represented as columns of categorical variables whose values are shared by multiple rows. Any continuous numeric variables (e.g., age) should be quantized in advance (e.g., into age ranges) to ensure there are sufficient instances of each value.
Multi-valued attributes (e.g., interests) are represented as multiple columns of binary variables (integer values of 0
and 1
) indicating the different values of that attribute (e.g., food, sports, politics).
The use_columns
parameter may be used to specify which data columns at sensitive_microdata_path
should be included in the output. An empty list []
indicates that all columns should be used.
Similarly, record_limit
may be used to limit data synthesis to the specified number of records, taken from the start of the sensitive data. A value of -1
indicates that all sensitive records should be modelled and synthesized.
The pipeline distinguishes 'positive' attribute values that indicate the presence of specific sensitive data from 'negative' attribute values that indicate the absence of such data. By default, the integer zero (0
) and the empty string (""
) and not taken into account when creating and counting attribute combinations. Any columns where zero values are of interest (and thus sensitive) should be listed in sensitive_zeros
. This pipeline treats such sensitive_zeros
in the same way as positive values.
To complement the synthetic microdata, the pipeline also precomputes reportable counts of sensitive records containing all short combinations of attributes. The privacy risk with such aggregate data is that small aggregate counts may identify specific groups of individuals, while precise counts may allow the detection of small differences over time. The pipeline thus protects the reported aggregate counts by rounding counts down to the closest multiple of the specified reporting_resolution
. The reporting_resolution
therefore acts as both the minimum threshold for reporting and the minimum difference between reported counts.
The reporting_length
determines the maximum length of attribute combination for which aggregate counts are precomputed and reported. In the user interface, this value determines how many attribute value selections a user may make while retaining the ability to compare estimated (synthetic) vs actual values. The number of selections is always one less than the reporting_length
. Specifying a reporting_length
of -1
indicates that combinations of all lengths should be computed. This is not recommended except for small or sparse datasets as the number of attribute combinations grows rapidly with their length.
The seeded
parameter indicates whether synthetic records should be seeded with a corresponding sensitive record (true
) or generated in an unseeded way by randomly sampling joint attribute distributions (false
). Seeded synthesis is faster and better preserves statistics for visual analytics, but unseeded synthesis creates longer records of more uniform length that may better preserve structure for machine learning.
Seeded synthesis proceeds by sampling attributes from a sensitive record until the addition of further attributes would create a rare combination based on the reporting_resolution
. These privacy-preserving subsets of sensitive records are collected for output as synthetic records. The unused attributes of each seed are also collected, with further output records synthesized from these (without replacement) until all sensitive attributes are accounted for in a synthetic record.
Since precise attribute counts create a privacy risk, it is advisable to create some uncertainty over the actual counts by adding noise to the synthetic data. The same reporting_resolution
used to create aggregate counts is used again here to suppress attributes or synthesize additional records such that synthetic attribute counts are equal to the (already imprecise) reported count.
Synthetic records are sorted by number of non-empty attribute values prior to output in a way that intermixes partly-suppressed and wholly-synthetic records.
The parallel_jobs
parameter specifies the extent of parallel processing (e.g., based on the number of available processor cores). For local processing, this should be set to the number of available CPU cores. For faster processing of larger and more complex datasets, use of a virtual machine with multiple cores is recommended.
Also for unseeded generation, the memory_limit_pct
parameter sets the percentage utilization of system memory at which synthetic data generation will stop adding computed counts to a cache. On the other hand, for seeded generation the cache-max-size
parameter sets the size of the cache used to store attribute combination counts during the synthesis process. The higher the value, the more memory will be used. The default value is 100000
which is generally a good balance between performance and memory usage, but this can be tuned depending on the input dataset.
Output files are saved to the output_dir
directory and prefixed with the prefix
string. The json config file used to generate the outputs is also copied to this directory as a record of the parameters used, and should therefore be stored outside output_dir
.
The showcase interface is created as a Power BI report that may be opened, explored, and shared using the free Power BI Desktop application. The Power BI report displays aggregated views of the sensitive data.
The report_title
parameter specifies the user-facing title shown within the Power BI report.
Related attributes spanning multiple columns in the sensitive dataset may be grouped together in a single visual by specifying report_visuals
as a new configuration attribute. This contains a mapping from the name of the visual to a list of column:value
pairs. The column:value
pairs are combined into a single visual with the given name. In the example below the visual name is vices
and it combines the list of column:value
pairs that follow. Up to 10 such visuals may be created.
"report_visuals": {
"vices": ["chocolate:1", "beer:1", "napping:1", ...],
...
}
The report_pages
configuration allows you to specify specific visuals to display and the order to display them. The custom visuals that are created by specifying report_visuals
can be used in the report_pages
configuration section. The report_pages
allows the specification of up to 16 attribute visuals per named page (up to 4 pages) as follows:
"report_pages": {
"By age/gender": ["age", "gender", "vices"],
"By city/job": ["city", "job", "vices"],
...
}
python showcase.py <config_path> --verbose | --v
Runs the complete pipeline using the specified json config file. To run the pipeline for individual stages, see the sections below. Use either form of the verbose flag for detailed output logs.
Multiple stages may also be specified, with any missing stage inputs resulting in prior stages being executed automatically. The complete pipeline can be executed with all stages specified as follows:
python showcase.py <config_path> --v --aggregate --generate --evaluate --navigate
It is recommended to begin with a smaller dataset (in terms of both rows and columns) and reporting length before scaling up based on performance (in terms of privacy, utility, and time).
python showcase.py <config_path> --aggregate | --agg
Generates the reportable_aggregates
tsv file containing precomputed and protected counts of all sensitive attribute combinations up to reporting_length
in length, as well as a sensitive_aggregates
tsv file storing the actual counts and a sensitive_aggregated_data
json file containing serialized information about the sensitive aggregates. These files are used in the --evaluate
pipeline stage to avoid recomputing combinations, and may be used to confirm actual values. Since these are highly sensitive, the file should be protected in the same way as the original microdata.
Additional outputs of this stage are tsv and svg summaries of sensitive_rare_by_length
– how many sensitive attribute combinations exist up to reporting_length
and what proportion of these are rare, i.e., occurring with a frequency below reporting_resolution
.
python showcase.py <config_path> --generate | --gen
Generates the synthetic_microdata
tsv file containing synthetic microdata representing the structure and statistics of data at sensitive_microdata_path
, without leaking any attribute combinations that are rare in the sensitive data.
python showcase.py <config_path> --evaluate | --eval
Compares the synthetic_microdata
to the sensitive_microdata
in terms of the proportion of sensitive combination counts that are preserved by the synthetic data (up to reporting_length
in length). Reads from the sensitive_aggregates
tsv file if available.
Outputs of this stage are tsv and svg summaries of:
synthetic_leakage_by_length
– how many synthetic attribute combinations exist up toreporting_length
and what proportion of these leak rare combinations from the sensitive data (by virtue of the synthesis process, this is guaranteed to be zero across all combination lengths).synthetic_preservation_by_length
– how many synthetic records are filtered on average for each combination length up toreporting_length
and what proportion of the corresponding sensitive count is captured by the synthetic count on average (longer combinations are naturally more rare so result in more loss / less preservation).synthetic_preservation_by_count
– how many synthetic attributes are needed on average to give a range of aggregate counts (shown on a log scale with labels representing the upper bound of the bin) and what proportion of the corresponding sensitive count is captured by the synthetic count on average (smaller counts are by definition more rare so result in more loss / less preservation).
python showcase.py <config_path> --navigate | --nav
Creates the data_showcase.pbit
Power BI template file combining both synthetic and aggregate data. Open this file in Power BI Desktop and when prompted enter the data path as the output_dir
folder specified in the json config file. This will load the output data files into the template, which may then be saved and shared as a .pbix
report file. The path must be specified as an absolute path.
python showme.py
Runs the complete pipeline on a single core for a small dataset (1000 rows) with many unique combinations (i.e., a very challenging dataset to share while preserving privacy).
Pipeline outputs are saved to the newly-created ./german_credit_data
directory and based on the Statlog (German Credit Data) dataset published by UCI here with source credit to:
Professor Dr. Hans Hofmann
Institut für Statistik und Ökonometrie
Universität Hamburg
FB Wirtschaftswissenschaften
Von-Melle-Park 5
2000 Hamburg 13
A sample json config file used to create this example is also saved to the current working directory for reference.