HMS 520 Autumn quarter 2021 final project public repository. This repository includes presentation information, all of the component scripts, and the description of our project.
The goal of this final project is to create a set of functions to clean validate finished extraction datasets from "by hand" scientific literature systematic review extractions. It performs these key functions:
- Check for missingness in user-defined columns
- Check for duplicates, using user-defined expected groupings of unique observations
- Apply user-specified custom validations
- Check for potential outlier candidates using Mean Absolute Deviation
- Split dataset into multiple bundles based on user-specified criteria
- Writes out a folder with diagnostic
.xlsx
and.txt
files describing the checks that were applied.
The only two scripts you will need to touch are:
config.R
save_report.R
The rest are child scripts called bysave_report.R
config.R
: generates config.RDS
, which is the config file of a list of all your inputs that is fed into the parent script. Modify the arguments in config.R
to fit your use case. Arguments:
source_dir
<- string; directory where your repository is located and from which you source your functionsconfig_dir
<- string; directory where you saveconfig.RDS
data_path
<- string; full path to where you have your data, must be.xlsx
ages_path
<- string; full path to where you have your GBD age data table (for age-standardizing outliering), must be.xlsx
output_root
<- string; directory where you save your output diagnostic files. createsoutput_dir
, a date-versioned folder inoutput_root
vars_check
<- character vector; vector of variable names that you want to check for missingness inmissing_check.R
byvars
<- character vector; vector of variable names which define what you expect to be a unique group induplicate_check.R
and also used inoutlier_check.R
validation_criteria
<- character list; list of criteria used for custom validation invalidation_check.R
n
<- integer; number of deviations to flag for potential outliering, default is 3flag_zeros
<- boolean; whether to flag zeros for potential outliering (generally T for common causes, F for rare causes)bundle_args
<- data.table; has three columns where each row corresponds to a bundle that you are splitting out of the parent dataset into its own.csv
inbundle_split.R
.
save_report.R
: This is the parent file. You will need to update the source_dir and the config_dir
to reflect your directories. All other inputs read from config.RDS
. It runs the child scripts and outputs a set of diagnostic files to your output directory.
The child scripts are launched from save_report.R in this order:
missing_check.R
duplicate_check.R
outlier_check.R
validation_check.R
write_outputs.R
bundle_split.R
- Their functions correspond to the numbered descriptions in the "about" section.
- Additional information on inputs and outputs for each function is available in the function headers.