diff --git a/clean-data.md b/clean-data.md index c451c77e..b6fd1d13 100644 --- a/clean-data.md +++ b/clean-data.md @@ -316,7 +316,7 @@ This approach simplifies the data cleaning process, ensuring that categorical da In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis. The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified - until the date this document was generated (2024-07-09). + until the date this document was generated (2024-08-01). ``` r @@ -343,9 +343,9 @@ utils::head(sim_ebola_data) 1 9 3 2 10 6 3 9 4 -4 9 6 -5 7 8 -6 8 5 +4 9 7 +5 7 9 +6 8 6 ``` After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months. @@ -399,7 +399,7 @@ individual cleansing steps within the broader data cleansing process. You can view the report using `cleanepi::print_report()` function. -![Example of data cleaning report generated by `{cleanepi}`](fig/report_demo.png) +![Example of data cleaning report generated by `{cleanepi}`.](fig/report_demo.png) ## Validating and tagging case data In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, @@ -439,6 +439,116 @@ utils::head(data, 7) // tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age ``` +The resulting `linelist` object resembles a data frame but offers richer features +and functionalities. Packages that are linelist-aware can leverage these +features. For example, you can extract a dataframe of only the tagged columns +using the `linelist::tags_df()` function, as shown below: + +``` r +head(linelist::tags_df(data), 5) +``` + +``` output + id date_onset date_reporting gender age +1 14905 2015-03-15 2015-04-06 male 90 +2 13043 2014-01-03 female 25 +3 14364 2014-02-09 2015-03-03 female 54 +4 14675 2014-10-19 2014-12-31 90 +5 12648 2014-06-08 2016-10-10 female 74 +``` + +Safeguarding is implicitly built into the linelist objects. If you try to delete any of the tagged +columns, you will receive an error or warning message, as shown in the example below. + + +``` r +new_df <- data |> + dplyr::select(linelist::has_tag(c("id", "age"))) +``` + +``` warning +Warning: The following tags have lost their variable: + date_onset:date_onset, date_reporting:date_sample, gender:gender +``` + +The default options for lost tags in a linelist object is warning. However, it can be change to error message using `lost_tags_action()`. + +::::::::::::::::::::::::::::::::::::: challenge + +- Set the action for lost tags in a linelist to error as follows: + + + ``` r + linelist::lost_tags_action(action = "error") + ``` +and re-run the above code segment. +- What do you learn for resulting complementary message? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +The `{linelist}` package supplies tags for the common epidemiological variables +and specify them the appropriate data types. You can view this by running the +following command: + +``` r +linelist::tags_types() +``` + +``` output +$id +[1] "numeric" "integer" "character" + +$date_onset +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$date_reporting +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$date_admission +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$date_discharge +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$date_outcome +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$date_death +[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt" + +$gender +[1] "character" "factor" + +$age +[1] "numeric" "integer" + +$location +[1] "character" "factor" + +$occupation +[1] "character" "factor" + +$hcw +[1] "logical" "integer" "character" "factor" + +$outcome +[1] "character" "factor" +``` +To ensure that all tagged variables are standardized and have the correct data +types, use the `linelist::validate_tags()` and `linelist::validate_types()` functions, respectively, as +shown in the example below: + +```r +linelist::validate_tags(data, + allow_extra = FALSE +) +linelist::validate_types(data, + ref_types = tags_types() +) +``` +If your dataset contains a `non-default` tag, set the argument +`allow_extra = TRUE` when creating the linelist object. + ::::::::::::::::::::::::::::::::::::: keypoints diff --git a/md5sum.txt b/md5sum.txt index 5b911ce5..cf9be076 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -5,7 +5,7 @@ "index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-07-02" "links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-07-02" "episodes/read-cases.Rmd" "b7aef81b60501065599814c0db15f512" "site/built/read-cases.md" "2024-07-02" -"episodes/clean-data.Rmd" "f945fe9d7dd34d01c0d02805a358a872" "site/built/clean-data.md" "2024-07-09" +"episodes/clean-data.Rmd" "2ef69b0a12062590eff29949b7102041" "site/built/clean-data.md" "2024-08-01" "episodes/describe-cases.Rmd" "cd9cb1c9d43eb3618e7a8a51b3748e55" "site/built/describe-cases.md" "2024-07-02" "instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2024-07-02" "learners/reference.md" "106717912e909a7c8d9e3e8fea48e17d" "site/built/reference.md" "2024-07-02"