-
Notifications
You must be signed in to change notification settings - Fork 34
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DC-2692] synthetic dataset script version 1 #1369
base: develop
Are you sure you want to change the base?
Commits on Aug 28, 2023
-
[DC-2692] script to run synthetic data only
* altering the base class to have another attribute which defaults to false * adding True attributes to those classes we want to run on a synthetic data set * altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed * works with listing the queries as well * need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally
Configuration menu - View commit details
-
Copy full SHA for 179a026 - Browse repository at this point
Copy the full SHA 179a026View commit details -
Configuration menu - View commit details
-
Copy full SHA for f7693b6 - Browse repository at this point
Copy the full SHA f7693b6View commit details -
[DC-2692] pylint ignore redefinition
* ignoring the redefinition here. * redefinition will be removed once all classes are base classed
Configuration menu - View commit details
-
Copy full SHA for f3adc92 - Browse repository at this point
Copy the full SHA f3adc92View commit details -
[DC-2692] rdr import modifications
* alter the import script to warn curation whenever we see a table in the bucket that we do not process
Configuration menu - View commit details
-
Copy full SHA for 1e13113 - Browse repository at this point
Copy the full SHA 1e13113View commit details -
[DC-2692] unit test alterations
* fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()
Configuration menu - View commit details
-
Copy full SHA for 19d32c8 - Browse repository at this point
Copy the full SHA 19d32c8View commit details -
[DC-2692] initial version of synthetic generation script
* loads data from a bucket into a raw dataset * creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean) * runs synthetic pipeline data stage on the data in the staging dataset * TODO: add publishing guidelines to script.
Configuration menu - View commit details
-
Copy full SHA for b9d8067 - Browse repository at this point
Copy the full SHA b9d8067View commit details -
[DC-2692] add mapping table and extension tables to script
* leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables. * update the synthetic data stage to leverage the Registered Tier dataset cleaning rules * allow extension table generation and cope survey versioning to run on synthetic data. * TODO: "publish" data to an internal dataset.
Configuration menu - View commit details
-
Copy full SHA for 51705eb - Browse repository at this point
Copy the full SHA 51705ebView commit details -
[DC-2692] add extra columns to the person table
* making sure person table columns are appended
Configuration menu - View commit details
-
Copy full SHA for 12daeba - Browse repository at this point
Copy the full SHA 12daebaView commit details -
[DC-2692] removing accidentally committed file
* The txt file was not meant for inclusion.
Configuration menu - View commit details
-
Copy full SHA for e9e515b - Browse repository at this point
Copy the full SHA e9e515bView commit details -
* some changes to the script while trying to run it initially * adding vocab_dataset parameter
Configuration menu - View commit details
-
Copy full SHA for 02f6be2 - Browse repository at this point
Copy the full SHA 02f6be2View commit details -
* changes required when running the synthetic script all the way through * the script did finish * more changes are expected
Configuration menu - View commit details
-
Copy full SHA for 92f9a94 - Browse repository at this point
Copy the full SHA 92f9a94View commit details -
[DC-2692] adding changes based on stashed files
* sets some run_for_synthetic rules to False to avoid dropping too much test data
Configuration menu - View commit details
-
Copy full SHA for 416c3f5 - Browse repository at this point
Copy the full SHA 416c3f5View commit details -
Configuration menu - View commit details
-
Copy full SHA for 9b3dc70 - Browse repository at this point
Copy the full SHA 9b3dc70View commit details -
* changed f-string usage to jinja2 templates * used pre-defined variable for constant value * removed redundant code to reuse existing dataset copy utility * removed conflict code
Configuration menu - View commit details
-
Copy full SHA for d20e2a5 - Browse repository at this point
Copy the full SHA d20e2a5View commit details -
Configuration menu - View commit details
-
Copy full SHA for ca916f8 - Browse repository at this point
Copy the full SHA ca916f8View commit details -
Configuration menu - View commit details
-
Copy full SHA for 4fa6b40 - Browse repository at this point
Copy the full SHA 4fa6b40View commit details -
Configuration menu - View commit details
-
Copy full SHA for 6661f31 - Browse repository at this point
Copy the full SHA 6661f31View commit details -
Configuration menu - View commit details
-
Copy full SHA for 7b820f7 - Browse repository at this point
Copy the full SHA 7b820f7View commit details -
[DC-2692] synthetic script enhancements
* uses cleaning rules to clean survey_conduct table data * removes duplicated code to create cleaned survey_conduct table data * prepares to potentially run all rules from RDR ingest to RT clean dataset * still only runs a subset of rules marked as run_for_synthetic
Configuration menu - View commit details
-
Copy full SHA for 70ed906 - Browse repository at this point
Copy the full SHA 70ed906View commit details