Skip to content

Commit

Permalink
differences for PR #95
Browse files Browse the repository at this point in the history
  • Loading branch information
actions-user committed Aug 1, 2024
1 parent 041a3cb commit fef4f2f
Show file tree
Hide file tree
Showing 2 changed files with 116 additions and 6 deletions.
120 changes: 115 additions & 5 deletions clean-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,7 +316,7 @@ This approach simplifies the data cleaning process, ensuring that categorical da

In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis.
The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified
until the date this document was generated (2024-07-09).
until the date this document was generated (2024-08-01).


``` r
Expand All @@ -343,9 +343,9 @@ utils::head(sim_ebola_data)
1 9 3
2 10 6
3 9 4
4 9 6
5 7 8
6 8 5
4 9 7
5 7 9
6 8 6
```

After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months.
Expand Down Expand Up @@ -399,7 +399,7 @@ individual cleansing steps within the broader data cleansing process.

You can view the report using `cleanepi::print_report()` function.

![Example of data cleaning report generated by `{cleanepi}`](fig/report_demo.png)
![Example of data cleaning report generated by `{cleanepi}`.](fig/report_demo.png)

## Validating and tagging case data
In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data,
Expand Down Expand Up @@ -439,6 +439,116 @@ utils::head(data, 7)
// tags: id:case_id, date_onset:date_onset, date_reporting:date_sample, gender:gender, age:age
```
The resulting `linelist` object resembles a data frame but offers richer features
and functionalities. Packages that are linelist-aware can leverage these
features. For example, you can extract a dataframe of only the tagged columns
using the `linelist::tags_df()` function, as shown below:

``` r
head(linelist::tags_df(data), 5)
```

``` output
id date_onset date_reporting gender age
1 14905 2015-03-15 2015-04-06 male 90
2 13043 <NA> 2014-01-03 female 25
3 14364 2014-02-09 2015-03-03 female 54
4 14675 2014-10-19 2014-12-31 <NA> 90
5 12648 2014-06-08 2016-10-10 female 74
```

Safeguarding is implicitly built into the linelist objects. If you try to delete any of the tagged
columns, you will receive an error or warning message, as shown in the example below.


``` r
new_df <- data |>
dplyr::select(linelist::has_tag(c("id", "age")))
```

``` warning
Warning: The following tags have lost their variable:
date_onset:date_onset, date_reporting:date_sample, gender:gender
```

The default options for lost tags in a linelist object is warning. However, it can be change to error message using `lost_tags_action()`.

::::::::::::::::::::::::::::::::::::: challenge

- Set the action for lost tags in a linelist to error as follows:


``` r
linelist::lost_tags_action(action = "error")
```
and re-run the above code segment.
- What do you learn for resulting complementary message?

::::::::::::::::::::::::::::::::::::::::::::::::

The `{linelist}` package supplies tags for the common epidemiological variables
and specify them the appropriate data types. You can view this by running the
following command:

``` r
linelist::tags_types()
```

``` output
$id
[1] "numeric" "integer" "character"
$date_onset
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$date_reporting
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$date_admission
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$date_discharge
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$date_outcome
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$date_death
[1] "integer" "numeric" "Date" "POSIXct" "POSIXlt"
$gender
[1] "character" "factor"
$age
[1] "numeric" "integer"
$location
[1] "character" "factor"
$occupation
[1] "character" "factor"
$hcw
[1] "logical" "integer" "character" "factor"
$outcome
[1] "character" "factor"
```
To ensure that all tagged variables are standardized and have the correct data
types, use the `linelist::validate_tags()` and `linelist::validate_types()` functions, respectively, as
shown in the example below:

```r
linelist::validate_tags(data,
allow_extra = FALSE
)
linelist::validate_types(data,
ref_types = tags_types()
)
```
If your dataset contains a `non-default` tag, set the argument
`allow_extra = TRUE` when creating the linelist object.


::::::::::::::::::::::::::::::::::::: keypoints

Expand Down
2 changes: 1 addition & 1 deletion md5sum.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
"index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-07-02"
"links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-07-02"
"episodes/read-cases.Rmd" "b7aef81b60501065599814c0db15f512" "site/built/read-cases.md" "2024-07-02"
"episodes/clean-data.Rmd" "f945fe9d7dd34d01c0d02805a358a872" "site/built/clean-data.md" "2024-07-09"
"episodes/clean-data.Rmd" "2ef69b0a12062590eff29949b7102041" "site/built/clean-data.md" "2024-08-01"
"episodes/describe-cases.Rmd" "cd9cb1c9d43eb3618e7a8a51b3748e55" "site/built/describe-cases.md" "2024-07-02"
"instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2024-07-02"
"learners/reference.md" "106717912e909a7c8d9e3e8fea48e17d" "site/built/reference.md" "2024-07-02"
Expand Down

0 comments on commit fef4f2f

Please sign in to comment.