Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disclosure risk metric planning / tracking #87

Open
7 of 15 tasks
jhseeman opened this issue Jul 11, 2024 · 2 comments
Open
7 of 15 tasks

Disclosure risk metric planning / tracking #87

jhseeman opened this issue Jul 11, 2024 · 2 comments

Comments

@jhseeman
Copy link
Collaborator

jhseeman commented Jul 11, 2024

Disclosure risk metrics planning

This issue will be used to plan updates for disclosure risk metrics in syntheval

Confidential data baseline assessments

  • Methods for identifying existing confidential records with high disclosure risk (edit: now in disc_baseline.R
  • Methods for identifying arbitrary records worth evaluating in holdouts (edit: deferred to 0.0.5)
    • disc_baseline_lra(conf_tables): linear reconstruction attack from a collection of count tables (link)
    • disc_baseline_make_canaries(conf_data): create artificial high-risk records for holdout data (e.g., "canaries" (link)

Membership inferences from synthetic data

  • Quasi-identifier probabilistic membership inference (edit: added in disc_qid_mi.R)
    • Partition selection probabilities from multiple replicates
    • Membership empirical intervals from multiple replicates
  • Membership inference updates for arbitrarily holdouts (link)
    • disc_mit(...) updates for multiple synthetic data replicates
    • disc_mit(...) updates for disaggregated records
    • disc_mit(...) updates for mechanism adaptivity (edit: deferred to 0.0.5)
  • Linkage attacks (edit: deferred to 0.0.5)
    • disc_linkage_recon(synth_data, recon): Linkage attack from synthetic data and partial reconstruction

Attribute inferences

  • disc_ait(synth_data, test_records): attribute inference for test_records using synthetic data-based models
  • disc_ait_compare(synth_data, test_records, holdout_data): attribute inference for test_records comparing differences between using synthetic and holdout data (link)
@awunderground
Copy link
Contributor

Confidential data baseline assessments

I would appreciate functionality/best practices for working with continuous variables and mixed-type data.

Membership inferences from synthetic data

Can you share a little more detail about the linkage attack functionality?

  1. I can imagine major differences between methods for partially and fully synthetic data.
  2. What is the direction of the linkage?

Attribute inferences

I've done some crude work on this. Let me know how I can help! The discriminator workflow we added is pretty flexible and leverages library(tidymodels).

@jhseeman
Copy link
Collaborator Author

@awunderground I updated this roadmap based on what was merged in. If you have some crude work done already on attribute inferences, any chance you'd be willing to add it to a branch? I can massage it to work with the 0.0.4 updates; I think this will be pretty flexible since it should probably take a tidymodels workflow as input

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants