From e53c370b3d59b61fe464bd5c6d2ace2c60eddb8a Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 15:14:18 +0100 Subject: [PATCH 01/12] replace object name for linelist data (fix #99) --- episodes/clean-data.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 890e6aa9..4f9dedcb 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -317,13 +317,13 @@ it's essential to establish an additional foundational layer to ensure the integ ```{r,warning=FALSE} library(linelist) -data <- linelist::make_linelist( +linelist_data <- linelist::make_linelist( x = cleaned_data, id = "case_id", date_onset = "date_onset", gender = "gender" ) -utils::head(data, 7) +utils::head(linelist_data, 7) ``` The `{linelist}` package supplies tags for common epidemiological variables @@ -377,7 +377,7 @@ Safeguarding is implicitly built into the linelist objects. If you try to drop a columns, you will receive an error or warning message, as shown in the example below. ```{r, warning=TRUE} -new_df <- data %>% +new_df <- linelist_data %>% dplyr::select(case_id, gender) ``` @@ -390,7 +390,7 @@ Let's test the implications of changing the **safeguarding** configuration from - First, run this code to count the frequency per category within a categorical variable: ```{r,eval=FALSE} -data %>% +linelist_data %>% dplyr::select(case_id, gender) %>% dplyr::count(gender) ``` @@ -419,7 +419,7 @@ types, use the `linelist::validate_linelist()`, as shown in the example below: ```r -linelist::validate_linelist(data) +linelist::validate_linelist(linelist_data) ``` @@ -511,7 +511,7 @@ and functionalities. Packages that are linelist-aware can leverage these features. For example, you can extract a dataframe of only the tagged columns using the `linelist::tags_df()` function, as shown below: ```{r, warning=FALSE} -head(linelist::tags_df(data), 5) +head(linelist::tags_df(linelist_data), 5) ``` This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode! From f9531fa6e5de06e750edd3d8a6916a6db7438ae5 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 15:18:23 +0100 Subject: [PATCH 02/12] add question connecting lost tag and validation --- episodes/clean-data.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 4f9dedcb..64237094 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -398,7 +398,7 @@ linelist_data %>% - Set behavior for lost tags in a `linelist` to "error" as follows: ```{r, eval=FALSE} -# set behavior to default "warning" +# set behavior to the default option: "warning" linelist::lost_tags_action() # set behavior to "error" @@ -457,7 +457,7 @@ cleaned_data %>% linelist::validate_linelist() ``` -Why are we getting this error message? +Why are we getting this `Error` message? Should we have a `Warning` message instead? :::::::::::::::::::::::::: From 17ad13bea8ab6be48372393f8d83909597cab628 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 15:25:35 +0100 Subject: [PATCH 03/12] use tibble outputs and remove head() (fix #100) --- episodes/clean-data.Rmd | 25 ++++++++++++++++--------- 1 file changed, 16 insertions(+), 9 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 64237094..39f6bdc0 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -61,19 +61,21 @@ The first step is to import the dataset following the guidelines outlined in the # e.g.: if path to file is data/simulated_ebola_2.csv then: raw_ebola_data <- rio::import( here::here("data", "simulated_ebola_2.csv") -) +) %>% + dplyr::as_tibble() # for a simple data frame output ``` ```{r,eval=TRUE,echo=FALSE,message=FALSE} # Read data raw_ebola_data <- rio::import( file.path("data", "simulated_ebola_2.csv") -) +) %>% + dplyr::as_tibble() # for a simple data frame output ``` ```{r, message=FALSE} # Return first five rows -utils::head(raw_ebola_data, 5) +raw_ebola_data ``` ## A quick inspection @@ -167,7 +169,7 @@ sim_ebola_data <- cleanepi::standardize_dates( ) ) -utils::head(sim_ebola_data) +sim_ebola_data ``` This function coverts the values in the target columns, or will automatically figure out the date columns within the dataset (if `target_columns = NULL`) and convert them into the **Ymd** format. @@ -180,7 +182,8 @@ The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illu sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data, target_columns = "age" ) -utils::head(sim_ebola_data) + +sim_ebola_data ``` ## Epidemiology related operations @@ -229,7 +232,8 @@ sim_ebola_data <- cleanepi::clean_using_dictionary( sim_ebola_data, dictionary = test_dict ) -utils::head(sim_ebola_data) + +sim_ebola_data ``` This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis. @@ -251,7 +255,8 @@ sim_ebola_data <- cleanepi::timespan( span_column_name = "time_since_sampling_date", span_remainder_unit = "months" ) -utils::head(sim_ebola_data) + +sim_ebola_data ``` After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months. @@ -317,13 +322,15 @@ it's essential to establish an additional foundational layer to ensure the integ ```{r,warning=FALSE} library(linelist) + linelist_data <- linelist::make_linelist( x = cleaned_data, id = "case_id", date_onset = "date_onset", gender = "gender" ) -utils::head(linelist_data, 7) + +linelist_data ``` The `{linelist}` package supplies tags for common epidemiological variables @@ -511,7 +518,7 @@ and functionalities. Packages that are linelist-aware can leverage these features. For example, you can extract a dataframe of only the tagged columns using the `linelist::tags_df()` function, as shown below: ```{r, warning=FALSE} -head(linelist::tags_df(linelist_data), 5) +linelist::tags_df(linelist_data) ``` This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode! From 653c83f4464964c11913a2d9b3abd2c33857d9f6 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 18:25:27 +0100 Subject: [PATCH 04/12] remove trailing white spaces --- episodes/clean-data.Rmd | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 39f6bdc0..b0f6f0e3 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -61,7 +61,7 @@ The first step is to import the dataset following the guidelines outlined in the # e.g.: if path to file is data/simulated_ebola_2.csv then: raw_ebola_data <- rio::import( here::here("data", "simulated_ebola_2.csv") -) %>% +) %>% dplyr::as_tibble() # for a simple data frame output ``` @@ -69,7 +69,7 @@ raw_ebola_data <- rio::import( # Read data raw_ebola_data <- rio::import( file.path("data", "simulated_ebola_2.csv") -) %>% +) %>% dplyr::as_tibble() # for a simple data frame output ``` From 5a21c1de102c592a62e99c47210c896588147128 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 19:20:15 +0100 Subject: [PATCH 05/12] add hint solution to validate challenge (fix #120) --- episodes/clean-data.Rmd | 80 ++++++++++++++++++++++++++--------------- 1 file changed, 51 insertions(+), 29 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index b0f6f0e3..6a7dc4c0 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -436,57 +436,75 @@ linelist::validate_linelist(linelist_data) ::::::::::::::::::::::::: challenge -Let's **validate** tagged variables. Let's simulate that in an ongoing outbreak; the next day, your data has a new set of entries but one variable change of data types. Describe how `linelist::validate_linelist()` reacts when input data has a different variable data type. +Let's **validate** tagged variables. Let's simulate that in an ongoing outbreak; the next day, your data has a new set of entries (i.e., rows or observations) but one variable change of data type. -Try to: +For example, the variable: +- `age` changes of type from a double (``) variable to character (``), -- **Change** a variable data type, -- **Tag** the linelist, and then -- **Validate** it +To simulate it: -Identify the correlation between the error messages and the output of `linelist::tags_types()`. +- **Change** the variable data type, +- **Tag** the variable into a linelist, and then +- **Validate** it. + +Describe how `linelist::validate_linelist()` reacts when input data has a different variable data type. + +:::::::::::::::::::::::::: hint + +We can use `dplyr::mutate()` to change the variable type before tagging for validation. For example: + +```{r} +cleaned_data %>% + # simulate a change of data type in one variable + dplyr::mutate(age = as.character(age)) %>% + # tag one variable + linelist::... %>% + # validate the linelist + linelist::... +``` + +:::::::::::::::::::::::::: :::::::::::::::::::::::::: hint -### Example +> Please run the code line by line, focusing only on the parts before the pipe (`%>%`). After each step, observe the output before moving to the next line. -If we change the `age` variable from numeric to character: +If the `age` variable changes from double (``) to character (``) we get the following: ```{r} cleaned_data %>% - # simulate a change of data type - dplyr::mutate(age_character = as.character(age)) %>% - # tag + # simulate a change of data type in one variable + dplyr::mutate(age = as.character(age)) %>% + # tag one variable linelist::make_linelist( - age = "age_character" + age = "age" ) %>% - # validate + # validate the linelist linelist::validate_linelist() ``` -Why are we getting this `Error` message? Should we have a `Warning` message instead? +Why are we getting an `Error` message? +Should we have a `Warning` message instead? Explain why. -:::::::::::::::::::::::::: +Now, try these additional changes to variables: +- `date_onset` changes from a `` variable to character (``), +- `gender` changes from a character (``) variable to integer (``). -::::::::::::::::::::::::: hint +Then tag them into a linelist for validation. Does the `Error` message propose to us the solution? -### More examples - -Other frequent changes can be having: - -- a date variable like `date_onset` changed to a character, or -- a factor variable like `gender` changed to an integer. +:::::::::::::::::::::::::: -Run these examples and answer: Why are we getting an error message? +::::::::::::::::::::::::: solution ```{r,eval=FALSE} -# example 2 +# Change 2 +# Run this code line by line to identify changes cleaned_data %>% # simulate a change of data type - dplyr::mutate(date_onset_character = as.character(date_onset)) %>% + dplyr::mutate(date_onset = as.character(date_onset)) %>% # tag linelist::make_linelist( - date_onset = "date_onset_character" + date_onset = "date_onset" ) %>% # validate linelist::validate_linelist() @@ -494,19 +512,23 @@ cleaned_data %>% ```{r,eval=FALSE} -# example 3 +# Change 3 +# Run this code line by line to identify changes cleaned_data %>% # simulate a change of data type dplyr::mutate(gender = as.factor(gender)) %>% - dplyr::mutate(gender_integer = as.integer(gender)) %>% + dplyr::mutate(gender = as.integer(gender)) %>% # tag linelist::make_linelist( - gender = "gender_integer" + gender = "gender" ) %>% # validate linelist::validate_linelist() ``` +We get `Error` messages because of the mismatch between the predefined tag type (from `linelist::tags_types()`) and the tagged variable class in the linelist. + +The `Error` message inform us that in order to **validate** our linelist, we must fix the input variable type to fit the expected tag type. In a data analysis script, we can do this by adding one cleaning step into the pipeline. ::::::::::::::::::::::::: From 56bb08d3b5ad79fe155ff2d3d457f0d06429731f Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 19:45:31 +0100 Subject: [PATCH 06/12] add solution to error warning challenge (fix #115) --- episodes/clean-data.Rmd | 32 ++++++++++++++++++++++++++++---- 1 file changed, 28 insertions(+), 4 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 6a7dc4c0..4d13ba03 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -405,9 +405,6 @@ linelist_data %>% - Set behavior for lost tags in a `linelist` to "error" as follows: ```{r, eval=FALSE} -# set behavior to the default option: "warning" -linelist::lost_tags_action() - # set behavior to "error" linelist::lost_tags_action(action = "error") ``` @@ -418,8 +415,20 @@ Identify: - What is the difference in the output between a `Warning` and an `Error`? - What could be the implications of this change for your daily data analysis pipeline during an outbreak response? -:::::::::::::::::::::::::::::::::::::::::::::::: +:::::::::::::::::::::::: solution + +Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. + +Before you continue, set the configuration back again to the default option of `Warning`: + +```{r} +# set behavior to the default option: "warning" +linelist::lost_tags_action() +``` +:::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::: To ensure that all tagged variables are standardized and have the correct data types, use the `linelist::validate_linelist()`, as @@ -534,6 +543,21 @@ The `Error` message inform us that in order to **validate** our linelist, we mus ::::::::::::::::::::::::: +::::::::::::::::::::::::: discussion + +Have you ever experienced an unexpected change of variable type when running a lengthy analysis during an emergency response? + +What actions did you take to overcome this inconvenience? + +::::::::::::::::::::::::: + +:::::::::::::::::::::::::: instructor + +If learners do not have an experience to share, we as instructors can share one. + +An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. + +:::::::::::::::::::::::::: A `linelist` object resembles a data frame but offers richer features and functionalities. Packages that are linelist-aware can leverage these From 03b3c22c21232f03ad514d7290be081b5cbead31 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 19:52:34 +0100 Subject: [PATCH 07/12] add extra paragraph to solution about safeguarding --- episodes/clean-data.Rmd | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 4d13ba03..f9b1ec82 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -417,7 +417,9 @@ Identify: :::::::::::::::::::::::: solution -Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. +Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. + +A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. Before you continue, set the configuration back again to the default option of `Warning`: From 73fba27e5f70d365b52b0c53e726bc61554c2338 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 19:54:59 +0100 Subject: [PATCH 08/12] reorder tag, validate, safeguard before tags_df() --- episodes/clean-data.Rmd | 108 ++++++++++++++++++++-------------------- 1 file changed, 54 insertions(+), 54 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index f9b1ec82..b8672921 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -379,59 +379,6 @@ How these additional tags are visible in the output? :::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::: - -Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged -columns, you will receive an error or warning message, as shown in the example below. - -```{r, warning=TRUE} -new_df <- linelist_data %>% - dplyr::select(case_id, gender) -``` - -This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using `linelist::lost_tags_action()`. - -::::::::::::::::::::::::::::::::::::: challenge - -Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message. - -- First, run this code to count the frequency per category within a categorical variable: - -```{r,eval=FALSE} -linelist_data %>% - dplyr::select(case_id, gender) %>% - dplyr::count(gender) -``` - -- Set behavior for lost tags in a `linelist` to "error" as follows: - -```{r, eval=FALSE} -# set behavior to "error" -linelist::lost_tags_action(action = "error") -``` -- Now, re-run the above code segment with `dplyr::count()`. - -Identify: - -- What is the difference in the output between a `Warning` and an `Error`? -- What could be the implications of this change for your daily data analysis pipeline during an outbreak response? - -:::::::::::::::::::::::: solution - -Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. - -A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. - -Before you continue, set the configuration back again to the default option of `Warning`: - -```{r} -# set behavior to the default option: "warning" -linelist::lost_tags_action() -``` - -:::::::::::::::::::::::: - -:::::::::::::::::::::::::::::::::::::::::::::::: - To ensure that all tagged variables are standardized and have the correct data types, use the `linelist::validate_linelist()`, as shown in the example below: @@ -561,10 +508,63 @@ An scenario like this usually happens when the institution doing the analysis is :::::::::::::::::::::::::: +Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged +columns, you will receive an error or warning message, as shown in the example below. + +```{r, warning=TRUE} +new_df <- linelist_data %>% + dplyr::select(case_id, gender) +``` + +This `Warning` message above is the default output option when we lose tags in a `linelist` object. However, it can be changed to an `Error` message using `linelist::lost_tags_action()`. + +::::::::::::::::::::::::::::::::::::: challenge + +Let's test the implications of changing the **safeguarding** configuration from a `Warning` to an `Error` message. + +- First, run this code to count the frequency per category within a categorical variable: + +```{r,eval=FALSE} +linelist_data %>% + dplyr::select(case_id, gender) %>% + dplyr::count(gender) +``` + +- Set behavior for lost tags in a `linelist` to "error" as follows: + +```{r, eval=FALSE} +# set behavior to "error" +linelist::lost_tags_action(action = "error") +``` +- Now, re-run the above code segment with `dplyr::count()`. + +Identify: + +- What is the difference in the output between a `Warning` and an `Error`? +- What could be the implications of this change for your daily data analysis pipeline during an outbreak response? + +:::::::::::::::::::::::: solution + +Deciding between `Warning` or `Error` message will depend on the level of attention or flexibility you need when losing tags. One will alert you about a change but will continue running the code downstream. The other will stop your analysis pipeline and the rest will not be executed. + +A data reading, cleaning and validation script may require a more stable or fixed pipeline. An exploratory data analysis may require a more flexible approach. These two processes can be isolated in different scripts or repositories to adjust the safeguarding according to your needs. + +Before you continue, set the configuration back again to the default option of `Warning`: + +```{r} +# set behavior to the default option: "warning" +linelist::lost_tags_action() +``` + +:::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::::::::::::::: + A `linelist` object resembles a data frame but offers richer features and functionalities. Packages that are linelist-aware can leverage these -features. For example, you can extract a dataframe of only the tagged columns +features. For example, you can extract a data frame of only the tagged columns using the `linelist::tags_df()` function, as shown below: + ```{r, warning=FALSE} linelist::tags_df(linelist_data) ``` From cb659dbc0fde1018815dc8755d7166920260fbfa Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 19:59:38 +0100 Subject: [PATCH 09/12] add variable loss in the discussion box --- episodes/clean-data.Rmd | 15 +++++++++++++++ 1 file changed, 15 insertions(+) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index b8672921..2ec40943 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -498,6 +498,8 @@ Have you ever experienced an unexpected change of variable type when running a l What actions did you take to overcome this inconvenience? +Imagine you automated your analysis to read your date directly from source, but they remove a variable you where using. What step would be sensible to this action? + ::::::::::::::::::::::::: :::::::::::::::::::::::::: instructor @@ -506,6 +508,19 @@ If learners do not have an experience to share, we as instructors can share one. An scenario like this usually happens when the institution doing the analysis is not the same as the institution collecting the data. The later can make decisions about the data structure that can affect downstream processes, impacting the time or the accuracy of the analysis results. +About losing variables, you can suggest learners to simulate this scenario: + +```{r} +cleaned_data %>% + # simulate a change of data type in one variable + select(-age) %>% + # tag one variable + linelist::make_linelist( + age = "age" + ) +``` + + :::::::::::::::::::::::::: Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged From 3c979927f1a87d8d3d2e1700b64033dcf790d968 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 20:13:35 +0100 Subject: [PATCH 10/12] fix writing of solutions and discussions --- episodes/clean-data.Rmd | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 2ec40943..99d8c0f9 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -411,7 +411,7 @@ Describe how `linelist::validate_linelist()` reacts when input data has a differ We can use `dplyr::mutate()` to change the variable type before tagging for validation. For example: -```{r} +```{r,eval=FALSE} cleaned_data %>% # simulate a change of data type in one variable dplyr::mutate(age = as.character(age)) %>% @@ -442,9 +442,11 @@ cleaned_data %>% ``` Why are we getting an `Error` message? -Should we have a `Warning` message instead? Explain why. -Now, try these additional changes to variables: + + +Explore other situations to understand this behavior. Let's try these additional changes to variables: + - `date_onset` changes from a `` variable to character (``), - `gender` changes from a character (``) variable to integer (``). @@ -494,11 +496,9 @@ The `Error` message inform us that in order to **validate** our linelist, we mus ::::::::::::::::::::::::: discussion -Have you ever experienced an unexpected change of variable type when running a lengthy analysis during an emergency response? - -What actions did you take to overcome this inconvenience? +Have you ever experienced an unexpected change of variable type when running a lengthy analysis during an emergency response? What actions did you take to overcome this inconvenience? -Imagine you automated your analysis to read your date directly from source, but they remove a variable you where using. What step would be sensible to this action? +Imagine you automated your analysis to read your date directly from source, but the people in charge of the data collection decided to remove a variable you found useful. What step along the `{linelist}` workflow of tagging and validating would response to the absence of a variable? ::::::::::::::::::::::::: From 53cb9089b9da24a649acac6c1bcee79243fa705a Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 20:17:19 +0100 Subject: [PATCH 11/12] clean spaces between lines and chunks + typos --- episodes/clean-data.Rmd | 23 +++++++++++++---------- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index 99d8c0f9..ffb44464 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -311,14 +311,15 @@ You can view the report using `cleanepi::print_report()` function. ## Validating and tagging case data + In outbreak analysis, once you have completed the initial steps of reading and cleaning the case data, -it's essential to establish an additional foundational layer to ensure the integrity and reliability of subsequent - analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within - your dataset, a process commonly referred to as "tagging." Additionally, it's crucial to implement measures to - validate that these tagged columns are not inadvertently deleted during further data processing steps. +it's essential to establish an additional foundation layer to ensure the integrity and reliability of subsequent +analyses. Specifically, this involves verifying the presence and correct data type of certain input columns within +your dataset, a process commonly referred to as "tagging." Additionally, it's crucial to implement measures to +validate that these tagged columns are not inadvertently deleted during further data processing steps. - This is achieved by converting the cleaned case data into a `linelist` object using `{linelist}` package, see the - below code chunk. +This is achieved by converting the cleaned case data into a `linelist` object using `{linelist}` package, see the +below code chunk. ```{r,warning=FALSE} library(linelist) @@ -339,6 +340,7 @@ and their acceptable data types for each using `linelist::tags_types()`. ::::::::::::::::::::::::::::::::::::: challenge + Let's **tag** more variables. In new datasets, it will be frequent to have variable names different to the available tag names. However, we can associate them based on how variables were defined for data collection. Now: @@ -350,6 +352,7 @@ Now: :::::::::::::::::::: hint Your can get access to the list of available tag names in {linelist} using: + ```{r, eval=FALSE} # Get a list of available tags by name and data types linelist::tags_types() @@ -357,7 +360,9 @@ linelist::tags_types() # Get a list of names only linelist::tags_names() ``` + ::::::::::::::::::::::: + ::::::::::::::::: solution ```{r,eval=FALSE} @@ -371,7 +376,6 @@ linelist::make_linelist( ) ``` - How these additional tags are visible in the output? @@ -386,6 +390,7 @@ shown in the example below: ```r linelist::validate_linelist(linelist_data) ``` + @@ -520,7 +525,6 @@ cleaned_data %>% ) ``` - :::::::::::::::::::::::::: Safeguarding is implicitly built into the linelist objects. If you try to drop any of the tagged @@ -586,11 +590,10 @@ linelist::tags_df(linelist_data) This allows, the extraction of use tagged-only columns in downstream analysis, which will be useful for the next episode! - - :::::::::::::::::::::::::::::::::::: callout ### When I should use `{linelist}`? + Data analysis during an outbreak response or mass-gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables). `{linelist}` is more appropriate for this type of ongoing or long-lasting analysis. From 754395754c498048b2d26476e914218c5a7d2761 Mon Sep 17 00:00:00 2001 From: Andree Valle Campos Date: Mon, 16 Sep 2024 20:18:33 +0100 Subject: [PATCH 12/12] fix linting issues --- episodes/clean-data.Rmd | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/episodes/clean-data.Rmd b/episodes/clean-data.Rmd index ffb44464..2903d232 100644 --- a/episodes/clean-data.Rmd +++ b/episodes/clean-data.Rmd @@ -517,12 +517,12 @@ About losing variables, you can suggest learners to simulate this scenario: ```{r} cleaned_data %>% - # simulate a change of data type in one variable - select(-age) %>% - # tag one variable - linelist::make_linelist( - age = "age" - ) + # simulate a change of data type in one variable + select(-age) %>% + # tag one variable + linelist::make_linelist( + age = "age" + ) ``` ::::::::::::::::::::::::::