diff --git a/clean-data.md b/clean-data.md index 0c8902bb..da67ca3d 100644 --- a/clean-data.md +++ b/clean-data.md @@ -71,7 +71,7 @@ raw_ebola_data <- rio::import( ``` r -# Return first five rows +# Print data frame raw_ebola_data ``` @@ -92,6 +92,36 @@ raw_ebola_data # ℹ 14,990 more rows ``` +::::::::::::::::: discussion + +Let's **diagnose** the data frame. List all the characteristics in the data frame above that are problematic for data analysis. + +Are any of those characteristics familiar with any previous data analysis you performed? + +:::::::::::::::::::::::::::: + +::::::::::::::::::: instructor + +Mediate a short discussion to relate the diagnosed characteristic with required cleaning operations. + +You can use these terms to **diagnose characteristics**: + +- *Codification*, like sex and age entries using numbers, letters, and words. Also dates in different arrangement ("dd/mm/yyyy" or "yyyy/mm/dd") and formats. Less visible, but also the column names. +- *Missing*, how to interpret an entry like "" in status or "-99" in another column? do we have a data dictionary from the data collection process? +- *Inconsistencies*, like having a date of sample before the date of onset. +- *Non-plausible values*, like outlier observations with dates outside of an expected timeframe. +- *Duplicates*, are all observations unique? + +You can use these terms to relate to **cleaning operations**: + +- Standardize column name +- Standardize categorical variables like sex/gender +- Standardize date columns +- Convert from character to numeric values +- Check the sequence of dated events + +:::::::::::::::::::::::::::::: + ## A quick inspection Quick exploration and inspection of the dataset are crucial before diving into any analysis tasks. The `{cleanepi}` package simplifies this process with the `scan_data()` function. Let's take a look at how you can use it: @@ -142,17 +172,23 @@ names(sim_ebola_data) :::::::::::::::::::::::::::::::::::::::::::::::: -If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` parameter of the `standardize_column_names()` function. This parameter accepts a vector of column names that are intended to be kept unchanged. +If you want to maintain certain column names without subjecting them to the standardization process, you can utilize the `keep` argument of the function `cleanepi::standardize_column_names()`. This argument accepts a vector of column names that are intended to be kept unchanged. ::::::::::::::::::::::::::::::::::::: challenge -Standardize the column names of the input dataset, but keep the “V1” column as it is. +Standardize the column names of the input dataset, but keep the first column names as it is. + +::::::::::::::::: hint + +You can try `cleanepi::standardize_column_names(data = raw_ebola_data, keep = "V1")` + +:::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::::::::::::: ### Removing irregularities -Raw data may contain irregularities such as duplicated and empty rows and columns, as well as constant columns. `remove_duplicates` and `remove_constants` functions from `{cleanepi}` remove such irregularities as demonstrated in the below code chunk. +Raw data may contain irregularities such as **duplicated** rows, **empty** rows and columns, or **constant** columns (where all entries have the same value.) Functions from `{cleanepi}` like `remove_duplicates()` and `remove_constants()` remove such irregularities as demonstrated in the below code chunk. ``` r @@ -176,7 +212,7 @@ sim_ebola_data <- cleanepi::replace_missing_values( ### Validating subject IDs -Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the `check_subject_ids` function designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria. +Each entry in the dataset represents a subject and should be distinguishable by a specific column formatted in a particular way, such as falling within a specified range, containing certain prefixes and/or suffixes, containing a specific number of characters. The `{cleanepi}` package offers the function `check_subject_ids()` designed precisely for this task as shown in the below code chunk. This function validates whether they are unique and meet the required criteria. @@ -200,9 +236,19 @@ Use the correct_subject_ids() function to adjust them. Note that our simulated dataset does contain duplicated subject IDS. +::::::::::::::::: spoiler + +### How to correct the subject IDs? + +Let's print a preliminary report with `cleanepi::print_report(sim_ebola_data)`. Focus on the "Unexpected subject ids" tab to identify what IDs require an extra treatment. + +After finishing this tutorial, we invite you to explore the package reference guide of `{cleanepi}` to find the function that can fix this situation. + +::::::::::::::::::::::::: + ### Standardizing dates -Certainly an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, ..etc, and these dates can come in different date forms, and it good practice to unify them. The `{cleanepi}` package provides functionality for converting date columns in epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: +Certainly, an epidemic dataset contains date columns for different events, such as the date of infection, date of symptoms onset, etc. These dates can come in different date forms, and it is good practice to standardize them. The `{cleanepi}` package provides functionality for converting date columns of epidemic datasets into ISO format, ensuring consistency across the different date columns. Here's how you can use it on our simulated dataset: ``` r @@ -239,7 +285,7 @@ This function coverts the values in the target columns, or will automatically fi ### Converting to numeric values In the raw dataset, some column can come with mixture of character and numerical values, and you want to covert the character values explicitly into numeric. For example, in our simulated data set, in the age column some entries are written in words. -The `convert_to_numeric()` function in `{cleanepi}` does such conversion as illustrated in the below code chunk. +In `{cleanepi}` the function `convert_to_numeric()` does such conversion as illustrated in the below code chunk. ``` r sim_ebola_data <- cleanepi::convert_to_numeric(sim_ebola_data, @@ -266,6 +312,14 @@ sim_ebola_data # ℹ 14,990 more rows ``` +::::::::::::::::: callout + +### Multiple language support + +Thanks to the `{numberize}` package, we can convert numbers written as English, French or Spanish words to positive integer values! + +::::::::::::::::::::::::: + ## Epidemiology related operations In addition to common data cleansing tasks, such as those discussed in the above section, the `{cleanepi}` package offers @@ -278,7 +332,7 @@ Ensuring the correct order and sequence of dated events is crucial in epidemiolo when analyzing infectious diseases where the timing of events like symptom onset and sample collection is essential. The `{cleanepi}` package provides a helpful function called `check_date_sequence()` precisely for this purpose. -Here's an example code chunk demonstrating the usage of `check_date_sequence()` function in our simulated Ebola dataset +Here's an example code chunk demonstrating the usage of the function `check_date_sequence()` in our simulated Ebola dataset ``` r @@ -289,7 +343,16 @@ sim_ebola_data <- cleanepi::check_date_sequence( ``` This functionality is crucial for ensuring data integrity and accuracy in epidemiological analyses, as it helps identify -any inconsistencies or errors in the chronological order of events, allowing yor to address them appropriately. +any inconsistencies or errors in the chronological order of events, allowing you to address them appropriately. + +::::::::::::::::: spoiler + +### What are the incorrect date sequences? + +Let's print another preliminary report with `cleanepi::print_report(sim_ebola_data)`. Focus on the "Incorrect date sequence" tab to identify what IDs had this issue. + +::::::::::::::::::::::::: + ### Dictionary-based substitution @@ -303,18 +366,22 @@ Moreover, `{cleanepi}` provides a built-in dictionary specifically tailored for ``` r test_dict <- base::readRDS( system.file("extdata", "test_dict.RDS", package = "cleanepi") -) -base::print(test_dict) +) %>% + dplyr::as_tibble() # for a simple data frame output + +test_dict ``` ``` output - options values grp orders -1 1 male gender 1 -2 2 female gender 2 -3 M male gender 3 -4 F female gender 4 -5 m male gender 5 -6 f female gender 6 +# A tibble: 6 × 4 + options values grp orders + +1 1 male gender 1 +2 2 female gender 2 +3 M male gender 3 +4 F female gender 4 +5 m male gender 5 +6 f female gender 6 ``` Now, we can use this dictionary to standardize values of the the “gender” column according to predefined categories. Below is an example code chunk demonstrating how to utilize this functionality: @@ -348,13 +415,39 @@ sim_ebola_data This approach simplifies the data cleaning process, ensuring that categorical data in epidemiological datasets is accurately categorized and ready for further analysis. -> Note that, when the column in the dataset contains values that are not in the dictionary, the clean_using_dictionary() will raise an error. Users can use the cleanepi::add_to_dictionary() function to include the missing value into the dictionary. See the corresponding section in the package [vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html) for more details. + +:::::::::::::::::::::::::: spoiler + +### How to create your own data dictionary? + +Note that, when the column in the dataset contains values that are not in the dictionary, the function `cleanepi::clean_using_dictionary()` will raise an error. + +You can use the function `cleanepi::add_to_dictionary()` to include the missing value in the dictionary. You can start a custom dictionary with a data frame: + +- `new_dictionary <- tibble::tibble(` +- ` options = "0",` +- ` values = "female",` +- ` grp = "sex",` +- ` orders = 1L` +- `) %>% ` +- ` cleanepi::add_to_dictionary(` +- ` option = "1",` +- ` value = "male",` +- ` grp = "sex",` +- ` order = NULL` +- ` )` + +You can read more details in the section about "Dictionary-based data substituting" in the package ["Get started" vignette](https://epiverse-trace.github.io/cleanepi/articles/cleanepi.html#dictionary-based-data-substituting). + +:::::::::::::::::::::::::: + ### Calculating time span between different date events -In epidemiological data analysis it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak or the duration between sample collection and analysis. -The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the `span()` function to compute the time elapsed since the date of sample for the case identified - until the date this document was generated (2024-09-24). +In epidemiological data analysis, it is also useful to track and analyze time-dependent events, such as the progression of a disease outbreak (i.e., the time difference between today and the first case reported) or the duration between sample collection and analysis (i.e., the time difference between today and the sample collection). The most common example is to calculate the age of all the subjects given their date of birth (i.e., the time difference between today and the date of birth). + +The `{cleanepi}` package offers a convenient function for calculating the time elapsed between two dated events at different time scales. For example, the below code snippet utilizes the function `cleanepi::timespan()` to compute the time elapsed since the date of sample for the case identified + until the date this document was generated (2024-09-30). ``` r @@ -367,28 +460,25 @@ sim_ebola_data <- cleanepi::timespan( span_remainder_unit = "months" ) -sim_ebola_data +sim_ebola_data %>% + dplyr::glimpse() ``` ``` output -# A tibble: 15,000 × 9 - v_1 case_id age gender status date_onset date_sample - - 1 1 14905 90 male confirmed 2015-03-15 2015-04-06 - 2 2 13043 25 female NA 2014-01-03 - 3 3 14364 54 female 2014-02-09 2015-03-03 - 4 4 14675 90 2014-10-19 2014-12-31 - 5 5 12648 74 female 2014-06-08 2016-10-10 - 6 6 14274 76 female NA 2016-01-23 - 7 7 14132 16 male confirmed NA 2015-10-05 - 8 8 14715 44 female confirmed NA 2016-04-24 - 9 9 13435 26 male 2014-07-09 2014-09-20 -10 10 14816 30 female 2015-06-29 2015-02-06 -# ℹ 14,990 more rows -# ℹ 2 more variables: time_since_sampling_date , remainder_months +Rows: 15,000 +Columns: 9 +$ v_1 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14… +$ case_id "14905", "13043", "14364", "14675", "12648", … +$ age 90, 25, 54, 90, 74, 76, 16, 44, 26, 30, 49, 4… +$ gender "male", "female", "female", NA, "female", "fe… +$ status "confirmed", NA, NA, NA, NA, NA, "confirmed",… +$ date_onset 2015-03-15, NA, 2014-02-09, 2014-10-19, 2014… +$ date_sample 2015-04-06, 2014-01-03, 2015-03-03, 2014-12-… +$ time_since_sampling_date 9, 10, 9, 9, 7, 8, 8, 8, 10, 9, 8, 9, 8, 9, 9… +$ remainder_months 5, 8, 6, 8, 11, 8, 11, 5, 0, 7, 3, 4, 3, 1, 8… ``` -After executing the `span()` function, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months. +After executing the function `cleanepi::timespan()`, two new columns named `time_since_sampling_date` and `remainder_months` are added to the **sim_ebola_data** dataset, containing the calculated time elapsed since the date of sampling for each case, measured in years, and the remaining time measured in months. ## Multiple operations at once @@ -404,19 +494,19 @@ Further more, you can combine multiple data cleaning tasks via the pipe operator cleaned_data <- raw_ebola_data %>% cleanepi::standardize_column_names() %>% cleanepi::replace_missing_values(na_strings = "") %>% - cleanepi::remove_constants(cutoff = 1.0) %>% - cleanepi::remove_duplicates(target_columns = NULL) %>% + cleanepi::remove_constants() %>% + cleanepi::remove_duplicates() %>% cleanepi::standardize_dates( - target_columns = c("date_onset", "date_sample"), - error_tolerance = 0.4, - format = NULL, - timeframe = NULL + target_columns = c("date_onset", "date_sample") ) %>% cleanepi::check_subject_ids( target_columns = "case_id", range = c(1, 15000) ) %>% cleanepi::convert_to_numeric(target_columns = "age") %>% + cleanepi::check_date_sequence( + target_columns = c("date_onset", "date_sample") + ) %>% cleanepi::clean_using_dictionary(dictionary = test_dict) ``` @@ -429,6 +519,59 @@ Warning: Detected incorrect subject ids at lines: Use the correct_subject_ids() function to adjust them. ``` +``` warning +Warning: Detected 676 incorrect date sequences at line(s): 10, 20, 24, 26, 27, +29, 39, 44, 46, 54, 59, 60, 62, 63, 65, 70, 73, 78, 81, 85, 88, 90, 94, 99, +101, 103, 104, 105, 106, 107, 110, 113, 117, 122, 126, 127, 137, 138, 142, 152, +158, 159, 168, 174, 177, 182, 187, 191, 195, 197, 200, 204, 208, 224, 230, 242, +243, 246, 265, 270, 282, 287, 289, 290, 296, 309, 321, 330, 332, 333, 339, 343, +344, 347, 353, 355, 357, 371, 375, 378, 380, 381, 386, 388, 391, 392, 398, 399, +403, 406, 410, 412, 435, 444, 453, 454, 456, 461, 463, 466, 471, 473, 478, 483, +484, 485, 491, 492, 495, 499, 500, 507, 508, 509, 511, 527, 530, 533, 534, 537, +543, 545, 563, 565, 568, 575, 576, 586, 587, 590, 595, 600, 602, 604, 609, 613, +614, 615, 622, 630, 646, 650, 652, 657, 661, 662, 664, 672, 675, 678, 680, 683, +684, 691, 696, 701, 705, 707, 709, 710, 711, 721, 724, 731, 732, 735, 740, 746, +749, 757, 759, 764, 781, 784, 789, 792, 793, 795, 798, 803, 808, 812, 813, 814, +817, 819, 820, 821, 822, 824, 828, 833, 838, 841, 843, 844, 847, 849, 851, 864, +872, 874, 875, 878, 879, 886, 889, 895, 900, 901, 903, 910, 923, 924, 940, 942, +944, 945, 947, 952, 953, 955, 960, 961, 963, 968, 979, 982, 992, 1005, 1009, +1012, 1030, 1040, 1045, 1052, 1055, 1072, 1083, 1086, 1093, 1094, 1095, 1099, +1100, 1108, 1110, 1111, 1116, 1117, 1123, 1128, 1132, 1133, 1135, 1138, 1142, +1157, 1161, 1166, 1170, 1172, 1190, 1203, 1205, 1211, 1214, 1217, 1218, 1220, +1230, 1233, 1240, 1268, 1278, 1279, 1281, 1293, 1295, 1299, 1306, 1307, 1309, +1311, 1313, 1317, 1319, 1324, 1325, 1335, 1346, 1348, 1349, 1350, 1357, 1360, +1362, 1363, 1371, 1379, 1380, 1384, 1386, 1388, 1396, 1399, 1405, 1406, 1408, +1411, 1415, 1420, 1431, 1434, 1438, 1448, 1453, 1461, 1472, 1476, 1480, 1481, +1501, 1505, 1506, 1511, 1523, 1531, 1536, 1542, 1545, 1547, 1550, 1551, 1553, +1554, 1573, 1579, 1580, 1581, 1587, 1588, 1589, 1592, 1596, 1598, 1600, 1601, +1603, 1609, 1611, 1612, 1618, 1621, 1623, 1628, 1629, 1631, 1632, 1635, 1638, +1642, 1643, 1648, 1650, 1655, 1659, 1660, 1663, 1667, 1670, 1672, 1676, 1679, +1681, 1686, 1687, 1689, 1690, 1691, 1695, 1698, 1699, 1707, 1711, 1712, 1713, +1714, 1720, 1727, 1729, 1748, 1750, 1751, 1772, 1776, 1784, 1795, 1799, 1803, +1805, 1807, 1809, 1810, 1812, 1821, 1827, 1829, 1830, 1837, 1844, 1846, 1860, +1878, 1879, 1880, 1891, 1897, 1899, 1915, 1917, 1926, 1927, 1931, 1936, 1945, +1946, 1949, 1951, 1954, 1957, 1959, 1962, 1966, 1970, 1972, 1986, 1989, 1992, +1996, 1998, 2023, 2025, 2028, 2029, 2035, 2036, 2037, 2039, 2045, 2047, 2052, +2054, 2059, 2060, 2064, 2074, 2075, 2078, 2081, 2082, 2084, 2087, 2094, 2095, +2100, 2112, 2114, 2122, 2128, 2131, 2133, 2137, 2146, 2158, 2167, 2171, 2178, +2181, 2188, 2189, 2194, 2195, 2209, 2210, 2226, 2231, 2237, 2240, 2242, 2250, +2254, 2257, 2261, 2263, 2264, 2265, 2271, 2272, 2277, 2278, 2282, 2287, 2297, +2299, 2311, 2313, 2316, 2317, 2328, 2329, 2330, 2333, 2337, 2338, 2343, 2347, +2350, 2353, 2362, 2370, 2371, 2373, 2374, 2379, 2382, 2383, 2391, 2394, 2399, +2405, 2409, 2412, 2413, 2421, 2427, 2428, 2430, 2432, 2433, 2441, 2444, 2445, +2449, 2451, 2452, 2453, 2461, 2462, 2479, 2480, 2482, 2483, 2491, 2496, 2515, +2519, 2526, 2532, 2535, 2538, 2540, 2542, 2547, 2548, 2550, 2558, 2562, 2563, +2572, 2583, 2594, 2595, 2604, 2616, 2617, 2627, 2640, 2641, 2645, 2647, 2655, +2666, 2680, 2686, 2690, 2695, 2700, 2704, 2705, 2711, 2720, 2721, 2723, 2735, +2741, 2742, 2745, 2746, 2749, 2765, 2767, 2768, 2788, 2792, 2797, 2801, 2807, +2811, 2828, 2829, 2830, 2838, 2839, 2847, 2848, 2849, 2855, 2863, 2865, 2869, +2872, 2889, 2890, 2900, 2901, 2906, 2907, 2921, 2922, 2923, 2926, 2927, 2932, +2936, 2939, 2940, 2942, 2944, 2945, 2954, 2955, 2961, 2962, 2965, 2973, 2975, +2984, 2988, 2992, 2994, 3001, 3002, 3009, 3012, 3013, 3015, 3020, 3021, 3035, +3036, 3041, 3046, 3053, 3058, 3066, 3069, 3071, 3073, 3076, 3077, 3078, 3080, +3087, 3093, 3096 +``` + ## Printing the clean report The `{cleanepi}` package generates a comprehensive report detailing the findings and actions of all data cleansing @@ -437,7 +580,7 @@ corresponds to a specific data cleansing operation, and clicking on each section that particular operation. This interactive approach enables users to efficiently review and analyze the outcomes of individual cleansing steps within the broader data cleansing process. -You can view the report using `cleanepi::print_report()` function. +You can view the report using the function `cleanepi::print_report(cleaned_data)`.

@@ -564,8 +707,7 @@ linelist::validate_linelist(linelist_data) Let's **validate** tagged variables. Let's simulate that in an ongoing outbreak; the next day, your data has a new set of entries (i.e., rows or observations) but one variable change of data type. -For example, the variable: -- `age` changes of type from a double (``) variable to character (``), +For example, let's make the variable `age` change of type from a double (``) variable to character (``), To simulate it: @@ -803,7 +945,7 @@ This allows, the extraction of use tagged-only columns in downstream analysis, w :::::::::::::::::::::::::::::::::::: callout -### When I should use `{linelist}`? +### When should I use `{linelist}`? Data analysis during an outbreak response or mass-gathering surveillance demands a different set of "data safeguards" if compared to usual research situations. For example, your data will change or be updated over time (e.g. new entries, new variables, renamed variables). @@ -817,7 +959,7 @@ Check the "Get started" vignette section about ::::::::::::::::::::::::::::::::::::: keypoints - Use `{cleanepi}` package to clean and standardize epidemic and outbreak data -- Use `{linelist}` to tagg, validate, and prepare case data for downstream analysis. +- Use `{linelist}` to tag, validate, and prepare case data for downstream analysis. :::::::::::::::::::::::::::::::::::::::::::::::: diff --git a/md5sum.txt b/md5sum.txt index 94ad6177..bcfb96e0 100644 --- a/md5sum.txt +++ b/md5sum.txt @@ -5,7 +5,7 @@ "index.md" "32bc80d6f4816435cc0e01540cb2a513" "site/built/index.md" "2024-09-24" "links.md" "fe82d0a436c46f4b07b82684ed2cceaf" "site/built/links.md" "2024-09-24" "episodes/read-cases.Rmd" "fe84511fc9f9e53a32e97eaddd50085e" "site/built/read-cases.md" "2024-09-24" -"episodes/clean-data.Rmd" "ae5437ea0dd82c262a6c7a20bb0d613d" "site/built/clean-data.md" "2024-09-24" +"episodes/clean-data.Rmd" "5a6377b6b7135ecb3e99981538ba12ab" "site/built/clean-data.md" "2024-09-30" "episodes/describe-cases.Rmd" "b4db44af62d6e22bff8775de52c93642" "site/built/describe-cases.md" "2024-09-24" "instructors/instructor-notes.md" "ca3834a1b0f9e70c4702aa7a367a6bb5" "site/built/instructor-notes.md" "2024-09-24" "learners/reference.md" "106717912e909a7c8d9e3e8fea48e17d" "site/built/reference.md" "2024-09-24"