diff --git a/config.yaml b/config.yaml deleted file mode 100644 index 930b95e7..00000000 --- a/config.yaml +++ /dev/null @@ -1,81 +0,0 @@ -#------------------------------------------------------------ -# Values for this lesson. -#------------------------------------------------------------ - -# Which carpentry is this (swc, dc, lc, or cp)? -# swc: Software Carpentry -# dc: Data Carpentry -# lc: Library Carpentry -# cp: Carpentries (to use for instructor training for instance) -# incubator: The Carpentries Incubator -carpentry: 'epiverse-trace' - -# Overall title for pages. -title: 'Outbreak analytics with R' - -# Date the lesson was created (YYYY-MM-DD, this is empty by default) -created: - -# Comma-separated list of keywords for the lesson -keywords: 'forecasts, epidemic models, interventions' - -# Life cycle stage of the lesson -# possible values: pre-alpha, alpha, beta, stable -life_cycle: 'pre-alpha' - -# License of the lesson materials (recommended CC-BY 4.0) -license: 'CC-BY 4.0' - -# Link to the source repository for this lesson -source: 'https://github.com/epiverse-trace/tutorials' - -# Default branch of your lesson -branch: 'main' - -# Who to contact if there are any issues -contact: 'andree.valle-campos@lshtm.ac.uk' - -# Navigation ------------------------------------------------ -# -# Use the following menu items to specify the order of -# individual pages in each dropdown section. Leave blank to -# include all pages in the folder. -# -# Example ------------- -# -# episodes: -# - introduction.md -# - first-steps.md -# -# learners: -# - setup.md -# -# instructors: -# - instructor-notes.md -# -# profiles: -# - one-learner.md -# - another-learner.md - -# Order of episodes in your lesson -episodes: -# - template.Rmd -- introduction.Rmd - -# Information for Learners -learners: - -# Information for Instructors -instructors: - -# Learner Profiles -profiles: - -# Customisation --------------------------------------------- -# -# This space below is where custom yaml items (e.g. pinning -# sandpaper and varnish versions) should live - -varnish: epiverse-trace/varnish@epiversetheme -# this is carpentries/sandpaper#533 in our fork so we can keep it up to date with main -sandpaper: epiverse-trace/sandpaper@patch-renv-github-bug diff --git a/data/linelist.rds b/data/linelist.rds new file mode 100644 index 00000000..f30a8715 Binary files /dev/null and b/data/linelist.rds differ diff --git a/delays-refresher.md b/delays-refresher.md new file mode 100644 index 00000000..6ccc6c28 --- /dev/null +++ b/delays-refresher.md @@ -0,0 +1,965 @@ +--- +title: 'Introduction to outbreak analytics' +teaching: 10 +exercises: 2 +--- + +:::::::::::::::::::::::::::::::::::::: questions + +- How to calculate the _naive_ case fatality risk (CFR)? +- How to visualize transmission from line list data? +- How to calculate delays from line list data? +- How to fit a probability distribution to delays? + +:::::::::::::::::::::::::::::::::::::::::::::::: + +::::::::::::::::::::::::::::::::::::: objectives + +- Use the pipe operator `%>%` to structure sequences of data operations left-to-right. +- Count observations in each group using `count()`. +- Pivot data from wide-to-long or long-to-wide using `pivot_*`. +- Create new columns that are functions of existing variables using `mutate()`. +- Keep or drop columns by their names using `select()`. +- Keep rows that match a condition using `filter()`. +- Extract a single column using `pull()`. +- Create graphics declaratively using `{ggplot2}`. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::: instructor + +Useful concepts maps to teach this episode are + +- +- +- + +:::::::::: + +:::::::::: prereq + +**Setup an RStudio project and folder** + +- Create an RStudio project. If needed, follow this [how-to guide on "Hello RStudio Projects"](https://docs.posit.co/ide/user/ide/get-started/#hello-rstudio-projects) to create one. +- Inside the RStudio project, create the `data/` folder. +- Inside the `data/` folder, save the [linelist.rds](https://epiverse-trace.github.io/tutorials/data/linelist.rds) file. + +:::::::::: + +::::::::::::::: checklist + +**RStudio projects** + +The directory of an RStudio Project named, for example `training`, should look like this: + +``` +training/ +|__ data/ +|__ training.Rproj +``` + +**RStudio Projects** allows you to use _relative file_ paths with respect to the `R` Project, +making your code more portable and less error-prone. +Avoids using `setwd()` with _absolute paths_ +like `"C:/Users/MyName/WeirdPath/training/data/file.csv"`. + +::::::::::::::: + +::::::::::::::: challenge + +Let's starts by creating `New Quarto Document`! + +1. In the RStudio IDE, go to: File > New File > Quarto Document +2. Accept the default options +3. Save the file with the name `01-report.qmd` +4. Use the `Render` button to render the file and preview the output. + + + + + +::::::::::::::: + +## Introduction + +A new Ebola Virus Disease (EVD) outbreak has been notified in a fictional country in West Africa. The Ministry of Health is coordinating the outbreak response and has contracted you as a consultant in epidemic analysis to inform the response in real-time. The available report of cases is coming from hospital admissions. + +Let's start by loading the package `{dplyr}` to manipulate data, `{tidyr}` to rearrange it, and `{here}` to write file paths within your RStudio project. We'll use the pipe `%>%` to connect some of their functions, including others from the package `{ggplot2}`, so let's call to the package `{tidyverse}` that loads them all: + + +``` r +# Load packages +library(tidyverse) # loads dplyr, tidyr and ggplot2 +``` + +## Explore data + +For the purpose of this episode, we will read a pre-cleaned line list data. Following episodes will tackle how to solve cleaning tasks. + + +``` r +# Read data +# e.g.: if path to file is data/linelist.rds then: +cases <- read_rds( + here::here("data", "linelist.rds") +) +``` + + + +:::::::::::::::::::: checklist + +**Why should we use the {here} package?** + +The `{here}` package is designed to simplify file referencing in R projects by providing a reliable way to construct file paths relative to the project root. The main reason to use it is **Cross-Environment Compatibility**. + +It works across different operating systems (Windows, Mac, Linux) without needing to adjust file paths. + +- On Windows, paths are written using backslashes ( `\` ) as the separator between folder names: `"data\raw-data\file.csv"` +- On Unix based operating system such as macOS or Linux the forward slash ( `/` ) is used as the path separator: `"data/raw-data/file.csv"` + +The `{here}` package is ideal for adding one more layer of reproducibility to your work. If you are interested in reproducibility, we invite you to [read this tutorial to increase the openess, sustainability, and reproducibility of your epidemic analysis with R](https://epiverse-trace.github.io/research-compendium/) + +:::::::::::::::::::: + + +``` r +# Print line list data +cases +``` + +``` output +# A tibble: 166 × 11 + case_id generation date_of_infection date_of_onset date_of_hospitalisation + + 1 d1fafd 0 NA 2014-04-07 2014-04-17 + 2 53371b 1 2014-04-09 2014-04-15 2014-04-20 + 3 f5c3d8 1 2014-04-18 2014-04-21 2014-04-25 + 4 e66fa4 2 NA 2014-04-21 2014-05-06 + 5 49731d 0 2014-03-19 2014-04-25 2014-05-02 + 6 0f58c4 2 2014-04-22 2014-04-26 2014-04-29 + 7 6c286a 2 NA 2014-04-27 2014-04-27 + 8 881bd4 3 2014-04-26 2014-05-01 2014-05-05 + 9 d58402 2 2014-04-23 2014-05-01 2014-05-10 +10 f9149b 3 NA 2014-05-03 2014-05-04 +# ℹ 156 more rows +# ℹ 6 more variables: date_of_outcome , outcome , gender , +# hospital , lon , lat +``` + +:::::::::::::: discussion + +Take some time to look at the data and structure here. + +- Are the data and format similar to line lists that you have seen in the past? +- If you were part of the outbreak investigation team, what other information might you want to collect? + +:::::::::::::: + +:::::::::::: instructor + +The information to collect will depend on the questions we need to give a response. + +At the beginning of an outbreak, we need data to give a response to questions like: + +- How fast does an epidemic grow? +- What is the risk of death? +- How many cases can I expect in the coming days? + +Informative indicators are: + +- growth rate, reproduction number. +- case fatality risk, hospitalization fatality risk. +- projection or forecast of cases. + +Useful data are: + +- date of onset, date of death. +- delays from infection to onset, from onset to death. +- percentage of observations detected by surveillance system. +- subject characteristics to stratify the analysis by person, place, time. + +:::::::::::: + +You may notice that there are **missing** entries. + + + +::::::::::::: spoiler + +We can also explore missing data with summary a visualization: + + +``` r +cases %>% + visdat::vis_miss() +``` + + + +::::::::::::: + +:::::::::::: discussion + +Why do we have more missings on date of infection or date of outcome? + +:::::::::::: + +::::::::::::: instructor + +- date of infection: mostly unknown, depended on limited coverage of contact tracing or outbreak research, and sensitive to recall bias from subjects. +- date of outcome: reporting delay + +::::::::::::: + +::::::::::::::::::: checklist + +**The double-colon** + +The double-colon `::` in R let you call a specific function from a package without loading the entire package into the current environment. + +For example, `dplyr::filter(data, condition)` uses `filter()` from the `{dplyr}` package. + +This help us remember package functions and avoid namespace conflicts. + +::::::::::::::::::: + +## Calculate severity + +A frequent indicator for severity is the case fatality risk (CFR). + +CFR is defined as the conditional probability of death given confirmed diagnosis, calculated as the cumulative number of deaths from an infectious disease over the number of confirmed diagnosed cases. + +We can use the function `dplyr::count()` to count the observations in each group of the variable `outcome`: + + +``` r +cases %>% + dplyr::count(outcome) +``` + +``` output +# A tibble: 3 × 2 + outcome n + +1 Death 65 +2 Recover 59 +3 42 +``` + +:::::::::::: discussion + +Report: + +- What to do with cases whose outcome is `NA`? + +- Should we consider missing outcomes to calculate the CFR? + +:::::::::::: + +:::::::::::: instructor + +CFR estimation is sensitive to: + +- **Right-censoring bias**. If we include observations with unknown final status we can underestimate the true CFR. + +- **Selection bias**. At the beginning of an outbreak, given that health systems collect most clinically severe cases, an early estimate of the CFR can overestimate the true CFR. + +:::::::::::: + +To calculate the CFR we can add more functions using the pipe `%>%` and structure sequences of data operations left-to-right. + +From the `cases` object we will use: + +- `dplyr::count()` to count the observations in each group of the variable `outcome`, +- `tidyr::pivot_wider()` to pivot the data long-to-wide with names from `outcome` and values from `n` columns, +- `cleanepi::standardize_column_names()` to standardize column names, +- `dplyr::mutate()` to create one new column `cases_known_outcome` as a function of existing variables `death` and `recover`. + + +``` r +# calculate the number of cases with known outcome +cases %>% + dplyr::count(outcome) %>% + tidyr::pivot_wider(names_from = outcome, values_from = n) %>% + cleanepi::standardize_column_names() %>% + dplyr::mutate(cases_known_outcome = death + recover) +``` + +``` output +# A tibble: 1 × 4 + death recover na cases_known_outcome + +1 65 59 42 124 +``` + +This way of writing almost look like writing a recipe! + +:::::::::::: challenge + +Calculate the CFR as the division of the number of **deaths** among **known outcomes**. Do this by adding one more pipe `%>%` in the last code chunk. + +Report: + +- What is the value of the CFR? + +:::::::::::: hint + +You can use the column names of variables to create one more column: + + +``` r +# calculate the naive CFR +cases %>% + count(outcome) %>% + pivot_wider(names_from = outcome, values_from = n) %>% + cleanepi::standardize_column_names() %>% + mutate(cases_known_outcome = death + recover) %>% + mutate(cfr = ... / ...) # replace the ... spaces +``` + +:::::::::::: + +:::::::::::: solution + + +``` r +# calculate the naive CFR +cases %>% + dplyr::count(outcome) %>% + tidyr::pivot_wider(names_from = outcome, values_from = n) %>% + cleanepi::standardize_column_names() %>% + dplyr::mutate(cases_known_outcome = death + recover) %>% + dplyr::mutate(cfr = death / cases_known_outcome) +``` + +``` output +# A tibble: 1 × 5 + death recover na cases_known_outcome cfr + +1 65 59 42 124 0.524 +``` + +This calculation is _naive_ because it tends to yield a biased and mostly underestimated CFR due to the time-delay from onset to death, only stabilising at the later stages of the outbreak. + +Now, as a comparison, how much a CFR estimate changes if we include unknown outcomes in the denominator? + +:::::::::::: + +:::::::::::: solution + + +``` r +# underestimate the naive CFR +cases %>% + dplyr::count(outcome) %>% + tidyr::pivot_wider(names_from = outcome, values_from = n) %>% + cleanepi::standardize_column_names() %>% + dplyr::mutate(cfr = death / (death + recover + na)) +``` + +``` output +# A tibble: 1 × 4 + death recover na cfr + +1 65 59 42 0.392 +``` + +Due to **right-censoring bias**, if we include observations with unknown final status we can underestimate the true CFR. + +:::::::::::: + +:::::::::::: + +Data today will not include outcomes from patients that are still hospitalised. Then, one relevant question to ask is: In average, how much time it would take to know the outcomes of hospitalised cases? For this we can calculate **delays**! + +## Calculate delays + +The time between sequence of dated events can vary between subjects. For example, we would expect the date of infection to always be before the date of symptom onset, and the later always before the date of hospitalization. + +In a random sample of 30 observations from the `cases` data frame we observe variability between the date of hospitalization and date of outcome: + + + +We can calculate the average time from hospitalisation to outcome from the line list. + +From the `cases` object we will use: + +- `dplyr::select()` to keep columns using their names, +- `dplyr::mutate()` to create one new column `outcome_delay` as a function of existing variables `date_of_outcome` and `date_of_hospitalisation`, +- `dplyr::filter()` to keep the rows that match a condition like `outcome_delay > 0`, +- `skimr::skim()` to get useful summary statistics + + +``` r +# delay from report to outcome +cases %>% + dplyr::select(case_id, date_of_hospitalisation, date_of_outcome) %>% + dplyr::mutate(outcome_delay = date_of_outcome - date_of_hospitalisation) %>% + dplyr::filter(outcome_delay > 0) %>% + skimr::skim(outcome_delay) +``` + + +Table: Data summary + +| | | +|:------------------------|:----------| +|Name |Piped data | +|Number of rows |123 | +|Number of columns |4 | +|_______________________ | | +|Column type frequency: | | +|difftime |1 | +|________________________ | | +|Group variables |None | + + +**Variable type: difftime** + +|skim_variable | n_missing| complete_rate|min |max |median | n_unique| +|:-------------|---------:|-------------:|:------|:-------|:------|--------:| +|outcome_delay | 0| 1|1 days |55 days |7 days | 29| + +::::::::::::::::: callout + +**Inconsistencies among sequence of dated-events?** + +Wait! Is it consistent to have negative time delays from primary to secondary observations, i.e., from hospitalisation to death? + +In the next episode called **Clean data** we will learn how to check sequence of dated-events and other frequent and challenging inconsistencies! + +::::::::::::::::: + +::::::::::::::::: challenge + +To calculate a _delay-adjusted_ CFR, we need to assume a known delay from onset to death. + +Using the `cases` object: + +- Calculate the summary statistics of the delay from onset to death. + +::::::::::::: hint + +Keep the rows that match a condition like `outcome == "Death"`: + + +``` r +# delay from onset to death +cases %>% + dplyr::filter(outcome == "Death") %>% + ...() # replace ... with downstream code +``` + +Is it consistent to have negative delays from onset of symptoms to death? + +::::::::::::: + +::::::::::::: solution + + +``` r +# delay from onset to death +cases %>% + dplyr::select(case_id, date_of_onset, date_of_outcome, outcome) %>% + dplyr::filter(outcome == "Death") %>% + dplyr::mutate(delay_onset_death = date_of_outcome - date_of_onset) %>% + dplyr::filter(delay_onset_death > 0) %>% + skimr::skim(delay_onset_death) +``` + + +Table: Data summary + +| | | +|:------------------------|:----------| +|Name |Piped data | +|Number of rows |46 | +|Number of columns |5 | +|_______________________ | | +|Column type frequency: | | +|difftime |1 | +|________________________ | | +|Group variables |None | + + +**Variable type: difftime** + +|skim_variable | n_missing| complete_rate|min |max |median | n_unique| +|:-----------------|---------:|-------------:|:------|:-------|:------|--------:| +|delay_onset_death | 0| 1|2 days |17 days |7 days | 15| + +Where is the source of the inconsistency? Let's say you want to keep the rows with negative delay values to investigate them. How would you do it? + +::::::::::::: + +:::::::::::: solution + +We can use `dplyr::filter()` again to identify the inconsistent observations: + + +``` r +# keep negative delays +cases %>% + dplyr::select(case_id, date_of_onset, date_of_outcome, outcome) %>% + dplyr::filter(outcome == "Death") %>% + dplyr::mutate(delay_onset_death = date_of_outcome - date_of_onset) %>% + dplyr::filter(delay_onset_death < 0) +``` + +``` output +# A tibble: 6 × 5 + case_id date_of_onset date_of_outcome outcome delay_onset_death + +1 c43190 2014-05-06 2014-04-26 Death -10 days +2 bd8c0e 2014-05-17 2014-05-09 Death -8 days +3 49d786 2014-05-20 2014-05-11 Death -9 days +4 e85785 2014-06-03 2014-06-02 Death -1 days +5 60f0c2 2014-06-07 2014-05-30 Death -8 days +6 a48f5d 2014-06-15 2014-06-10 Death -5 days +``` + +More on estimating a _delay-adjusted_ CFR on the episode about **Estimating outbreak severity**! + +:::::::::::: + +::::::::::::::::: + + +## Visualize transmission + +The first question we want to know is simply: how bad is it? The first step of the analysis is descriptive. We want to draw an epidemic curve or epicurve. This visualises the incidence over time by date of symptom onset. + +From the `cases` object we will use: + +- `ggplot()` to declare the input data frame, +- `aes()` for the variable `date_of_onset` to map to `geoms`, +- `geom_histogram()` to visualise the distribution of a single continuous variable with a `binwidth` equal to 7 days. + + +``` r +# incidence curve +cases %>% + ggplot(aes(x = date_of_onset)) + + geom_histogram(binwidth = 7) +``` + + + +:::::::::::: discussion + +The early phase of an outbreak usually growths exponentially. + +- Why exponential growth may not be observed in the most recent weeks? + +:::::::::::: + +::::::::::: instructor + +Close inspection of the line list shows that the last date of any entry (by date of hospitalization) is a bit later than the last date of symptom onset. + +From the `cases` object we can use: + +- `dplyr::summarise()` to summarise each group down to one row, +- `base::max` to calculate the maximum dates of onset and hospitalisation. + + +``` r +cases %>% + dplyr::summarise( + max_onset = max(date_of_onset), + max_hospital = max(date_of_hospitalisation) + ) +``` + +``` output +# A tibble: 1 × 2 + max_onset max_hospital + +1 2014-06-27 2014-07-07 +``` + +::::::::::: + +You may want to examine how long after onset of symptoms cases are hospitalised; this may inform the **reporting delay** from this line list data: + + +``` r +# reporting delay +cases %>% + dplyr::select(case_id, date_of_onset, date_of_hospitalisation) %>% + dplyr::mutate(reporting_delay = date_of_hospitalisation - date_of_onset) %>% + ggplot(aes(x = reporting_delay)) + + geom_histogram(binwidth = 1) +``` + + + +The distribution of the reporting delay in day units is heavily skewed. Symptomatic cases may take up to **two weeks** to be reported. + +From reports (hospitalisations) in the most recent two weeks, we completed the exponential growth trend of incidence cases within the last four weeks: + + + +Given to reporting delays during this outbreak, two weeks ago it seemed that we had a three-week decay of cases. We needed to wait a couple of weeks to complete the incidence of cases on each week. + +:::::::::::::: challenge + +Report: + +- What indicator can we use to estimate transmission from the incidence curve? + +::::::::: solution + +- The growth rate! by fitting a linear model. +- The reproduction number accounting for delays from secondary observations to infection. + +More on this topic on episodes about **Aggregate and visualize** and **Quantifying transmission**. + + + + +``` output +# A tibble: 2 × 6 + count_variable term estimate std.error statistic p.value + +1 date_of_onset (Intercept) -537. 70.1 -7.67 1.76e-14 +2 date_of_onset date_index 0.233 0.0302 7.71 1.29e-14 +``` + + +::::::::: + +:::::::::::::: + +Lastly, in order to account for these _epidemiological delays_ when estimating indicators of severity or transmission, in our analysis we need to input delays as **Probability Distributions**! + +## Fit a probability distribution to delays + +::::::::::::::::::: prereq + +**Watch** one 5-minute video refresher on probability distributions: + +- StatQuest with Josh Starmer (2017) +**The Main Ideas behind Probability Distributions**, YouTube. +Available at: + + +::::::::::::::::::: + +We fit a probability distribution to data (like delays) to make inferences about it. These inferences can be useful for Public health interventions and decision making. For example, from the [incubation period](reference.md#incubation) distribution we can inform the length of active monitoring or quarantine by inferring the time by which 99% of infected individuals are expected to show symptoms. + +:::::::::::: checklist + +### Functions for the Normal distribution + +If you need it, read in detail about the [R probability functions for the normal distribution](https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture13.htm#probfunc), each of its definitions and identify in which part of a distribution they are located! + +![The four probability functions for the normal distribution ([Jack Weiss, 2012](https://sakai.unc.edu/access/content/group/3d1eb92e-7848-4f55-90c3-7c72a54e7e43/public/docs/lectures/lecture13.htm#probfunc))](fig/fig5a-normaldistribution.png) + +:::::::::::::::::::: + +If you look at `?stats::Distributions`, each type of distribution has a unique set of functions and different parameters. To relate Distribution functions and its parameters we suggest to explore a shinyapp called **The Distribution Zoo**: + +From the `cases` object we can use: + +- `dplyr::mutate()` to transform the `reporting_delay` class object from `