Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DC-2692] synthetic dataset script version 1 #1369

Open
wants to merge 19 commits into
base: develop
Choose a base branch
from

Commits on Aug 28, 2023

  1. [DC-2692] script to run synthetic data only

    * altering the base class to have another attribute which defaults to false
    * adding True attributes to those classes we want to run on a synthetic data set
    * altering the clean engine to only run those rules for synthetic datasets when synthetic is selected and the rule says it should be executed
    * works with listing the queries as well
    * need to work on the opentelemetry implemenetation to not throw errors when running something from the command line locally
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    179a026 View commit details
    Browse the repository at this point in the history
  2. [DC-2692] unit test fixes

    * fixing two unit tests
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    f7693b6 View commit details
    Browse the repository at this point in the history
  3. [DC-2692] pylint ignore redefinition

    * ignoring the redefinition here.
    * redefinition will be removed once all classes are base classed
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    f3adc92 View commit details
    Browse the repository at this point in the history
  4. [DC-2692] rdr import modifications

    * alter the import script to warn curation whenever we see a table in the bucket that we do not process
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    1e13113 View commit details
    Browse the repository at this point in the history
  5. [DC-2692] unit test alterations

    * fixes failing unit tests impacted by the addtion of the run_synthetic parameter to infer_rule()
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    19d32c8 View commit details
    Browse the repository at this point in the history
  6. [DC-2692] initial version of synthetic generation script

    * loads data from a bucket into a raw dataset
    * creates a synthetic dataset and it's appropriate versions (staging, sandbox, and clean)
    * runs synthetic pipeline data stage on the data in the staging dataset
    * TODO:  add publishing guidelines to script.
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    b9d8067 View commit details
    Browse the repository at this point in the history
  7. [DC-2692] add mapping table and extension tables to script

    * leverage function in `create_combined_backup_dataset.py` to create rudimentary rdr mapping tables.
    * update the synthetic data stage to leverage the Registered Tier dataset cleaning rules
    * allow extension table generation and cope survey versioning to run on synthetic data.
    * TODO:  "publish" data to an internal dataset.
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    51705eb View commit details
    Browse the repository at this point in the history
  8. [DC-2692] add extra columns to the person table

    * making sure person table columns are appended
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    12daeba View commit details
    Browse the repository at this point in the history
  9. [DC-2692] removing accidentally committed file

    * The txt file was not meant for inclusion.
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    e9e515b View commit details
    Browse the repository at this point in the history
  10. [DC-2692]

    * some changes to the script while trying to run it initially
    * adding vocab_dataset parameter
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    02f6be2 View commit details
    Browse the repository at this point in the history
  11. [DC-2692] synthetic script

    * changes required when running the synthetic script all the way through
    * the script did finish
    * more changes are expected
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    92f9a94 View commit details
    Browse the repository at this point in the history
  12. [DC-2692] adding changes based on stashed files

    * sets some run_for_synthetic rules to False to avoid dropping too much test data
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    416c3f5 View commit details
    Browse the repository at this point in the history
  13. Configuration menu
    Copy the full SHA
    9b3dc70 View commit details
    Browse the repository at this point in the history
  14. [DC-2692] more changes

    * changed f-string usage to jinja2 templates
    * used pre-defined variable for constant value
    * removed redundant code to reuse existing dataset copy utility
    * removed conflict code
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    d20e2a5 View commit details
    Browse the repository at this point in the history
  15. Configuration menu
    Copy the full SHA
    ca916f8 View commit details
    Browse the repository at this point in the history
  16. Configuration menu
    Copy the full SHA
    4fa6b40 View commit details
    Browse the repository at this point in the history
  17. Configuration menu
    Copy the full SHA
    6661f31 View commit details
    Browse the repository at this point in the history
  18. [DC-2692] conflict changes

    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    7b820f7 View commit details
    Browse the repository at this point in the history
  19. [DC-2692] synthetic script enhancements

    * uses cleaning rules to clean survey_conduct table data
    * removes duplicated code to create cleaned survey_conduct table data
    * prepares to potentially run all rules from RDR ingest to RT clean dataset
    * still only runs a subset of rules marked as run_for_synthetic
    lrwb-aou committed Aug 28, 2023
    Configuration menu
    Copy the full SHA
    70ed906 View commit details
    Browse the repository at this point in the history