-
Notifications
You must be signed in to change notification settings - Fork 0
/
compile_avoidable_death_codes_2020.Rmd
202 lines (154 loc) · 9.04 KB
/
compile_avoidable_death_codes_2020.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
---
title: "Compiling avoidable deaths codes (as coded by NRS), 2020"
author: "Jan Savinc"
date: '`r format(Sys.Date(), "%B %d, %Y")`'
output:
html_document:
code_folding: hide
toc: true
toc_float: true
editor_options:
chunk_output_type: console
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```
# Introduction
NRS publish an annual report on the number of avoidable/treatable deaths. The latest report is from March 2022 and reports on data from 2020, available on [the NRS website](https://www.nrscotland.gov.uk/statistics-and-data/statistics/statistics-by-theme/vital-events/deaths/avoidable-mortality).
The data used in the report can be downloaded in .csv and .xlsx format - I'll extract the code list for avoidable deaths from the excel file (.xlsx), sheet `Table A`: [https://www.nrscotland.gov.uk/files//statistics/avoidable-mortality/2020/avoid-mortality-20-all-tabs.xlsx](https://www.nrscotland.gov.uk/files//statistics/avoidable-mortality/2020/avoid-mortality-20-all-tabs.xlsx)
The main change is that the NRS has joined efforts with ONS and adopted the 'Joint OECD/Eurostat' definition of avoidable mortality, including codelists. These include Covid-19, also!
# Load packages
```{r}
library(tidyverse)
library(readxl) # for loading excel files
library(janitor) # for cleaning data
library(knitr) # for pretty tables in the report
```
# Download file
```{r}
url_avoidable_deaths_2020 <- "https://www.nrscotland.gov.uk/files//statistics/avoidable-mortality/2020/avoid-mortality-20-all-tabs.xlsx"
path_avoidable_deaths_2020 <- file.path("./downloaded_data", basename(url_avoidable_deaths_2020))
if (!file.exists(path_avoidable_deaths_2020)) download.file(url = url_avoidable_deaths_2020, destfile = path_avoidable_deaths_2020, mode = "wb")
```
# Load definitions
The excel sheet containing definitions is helpfully named 'Definitions':
```{r}
avoidable_deaths_raw <- read_excel(path = path_avoidable_deaths_2020, sheet = "Table A", range = "A4:E115") %>%
clean_names
```
## Inclusions and exclusions
The column `ICD-10 codes` lists codes in the following formats:
* most codes are listed as comma-separated values (e.g. `A35, A36, A80`) - these can be easily 'split' into several rows with one value per row
* some codes are listed as ranges (using dash/hyphen, e.g. `A00-A09`) - these need to be parsed. This can be done in a 'dumb' manner without reference to the specifics of ICD-10 where not all codes in such a range exist, because invalid codes simply won't match actual data
* some codes are listed with exclusions, which are always listed in parentheses (e.g. `A40 (excl. A40.3), A41 (excl. A41.3)`). This could be somewhat complex because we might need to track which exclusion applies to which code. However, it looks like there are only 2 entries with excepetions, and there is only ever one exception per listed code or code range, which means the exclusion can simply go in a new column.
- Note that it is possible that the same death record may contain both codes that are included under avoidable deaths and excluded. This is something to check in the actual data!
```{r}
## note: "\\p{Pd}" captures all types of hyphens/dashes
avoidable_deaths_raw %>% filter(str_detect(string = icd_10_codes, pattern = regex("excl", ignore_case = TRUE))) %>% kable(caption = "Entries with exclusions listed")
avoidable_deaths_raw %>% filter(str_detect(string = icd_10_codes, pattern = regex("\\p{Pd}", ignore_case = TRUE))) %>% kable(caption = "Entries specifying code ranges")
parse_icd_10_range <- function(code_range) {
## helper function for generating code ranges - based on the icd package, function .icd10cm_get_majors_possible()
if (is.na(code_range) | !str_detect(string = code_range, pattern="\\p{Pd}")) return(code_range)
letter_1 <- str_replace(string = code_range, pattern = "^([A-Za-z]).*$", replacement = "\\1")
letter_2 <- str_replace(string = code_range, pattern = "^.+\\p{Pd}([A-Za-z]).*$", replacement = "\\1")
letters <- LETTERS[which(LETTERS==letter_1) : which(LETTERS==letter_2)]
decimal <- str_detect(string = code_range, pattern = "\\.")
number_of_digits <- 2
if (decimal) number_of_digits <- 3
number_1 <- str_replace(string = str_remove_all(code_range, pattern="\\."), pattern = "^[A-Za-z](\\d{1,3})\\p{Pd}.*$", replacement = "\\1")
number_2 <- str_replace(string = str_remove_all(code_range, pattern="\\."), pattern = "^.+\\p{Pd}[A-Za-z](\\d{1,3})$", replacement = "\\1")
code_1 <- paste0(letter_1,number_1)
code_2 <- paste0(letter_2,number_2)
if (decimal) {
candidates <- sort(apply(
expand.grid(letters, 0:9, 0:9, 0:9),
MARGIN = 1,
FUN = paste0,
collapse = ""
))
} else {
candidates <- sort(apply(
expand.grid(letters, 0:9, 0:9),
MARGIN = 1,
FUN = paste0,
collapse = ""
))
}
parsed_codes <- candidates[seq.int(
from = which(candidates == code_1),
to = which(candidates == code_2)
)]
if (decimal) parsed_codes <- paste0(str_sub(parsed_codes, 1, 3), ".", str_sub(parsed_codes, 4, 4))
parsed_codes <- paste0(parsed_codes, collapse=",")
return(parsed_codes)
}
```
## Parsing clean data
```{r}
avoidable_deaths <-
avoidable_deaths_raw %>%
drop_na(condition_group_and_cause) %>% # remove blank rows used for blank space in the excel file
mutate(across(everything(), ~str_replace_all(string = .x, pattern = "\\p{Pd}", replacement = "-"))) %>% # replace all hyphen-likes with normal hyphen
mutate(
group = if_else(condition = is.na(icd_10_codes), true = condition_group_and_cause, false = NA_character_),
original_codes = icd_10_codes
) %>%
fill(group, .direction = "down") %>% # forward-fill group
drop_na(icd_10_codes) %>%
separate_rows(icd_10_codes, sep=",\\s*") %>% # separate comma-separated values into rows so we have one entry per row
extract(col = icd_10_codes, into = "exclusions", regex = "\\(excl\\.\\s*(.+)\\)", remove = FALSE) %>% # extract the exclusions and put them in a separate column
mutate(
icd_10_codes = str_remove_all(string = icd_10_codes, pattern = "\\s*\\(excl.+\\)"), # remove the exclusions from the code ranges
icd_10_codes = map_chr(.x = icd_10_codes, .f = ~parse_icd_10_range(.x))
) %>%
separate_rows(icd_10_codes, sep=",\\s*") %>% # separate comma-separated values again now that we've converted them from the ranges
extract(col = age, into = c("age_min", "age_max"), regex = "(\\d+)[^0-9]+(\\d+)", remove = TRUE, convert = TRUE) %>%
mutate(
treatable = case_when( # make more explicit the treatable/preventable distinction
str_detect(treatable, pattern = "50") ~ 0.5,
str_detect(treatable, pattern = "\\•") ~ 1,
TRUE ~ 0
),
preventable = case_when(
str_detect(preventable, pattern = "50") ~ 0.5,
str_detect(preventable, pattern = "\\•") ~ 1,
TRUE ~ 0
),
) %>%
relocate(group, condition_group_and_cause, original_codes)
## all codes should be treatable, preventable, or 50-50
avoidable_deaths %>% count(treatable, preventable)
```
## Checking parsing of ranges
```{r}
avoidable_deaths_raw %>%
drop_na(condition_group_and_cause) %>% # remove blank rows used for blank space in the excel file
separate_rows(icd_10_codes, sep=",\\s*") %>% # separate comma-separated values into rows so we have one entry per row
extract(col = icd_10_codes, into = "exclusions", regex = "\\(excl\\.\\s*(.+)\\)", remove = FALSE) %>% # extract the exclusions and put them in a separate column
filter(str_detect(string = icd_10_codes, pattern = "\\p{Pd}")) %>%
mutate(
icd_10_codes = str_remove_all(string = icd_10_codes, pattern = "\\s*\\(excl.+\\)"), # remove the exclusions from the code ranges
icd_10_codes_parsed = map_chr(.x = icd_10_codes, .f = ~parse_icd_10_range(.x))
) %>%
select(icd_10_codes, icd_10_codes_parsed) %>%
kable(caption = "Parsing of codes defined as ranges.")
```
## Checking parsing of age
There was previously an error with this, with some ages having a leading space, and a trailing invisible character. This is slightly overkill to check since the minimum and maximum ages are the same for all conditions, but it may change in the future! I fixed this by using `extract()` to parse the digits from the entries rather than just splitting them on the hyphen symbol.
```{r}
avoidable_deaths_raw %>%
filter(age!="0-74")
avoidable_deaths_raw %>%
count(age) %>%
kable(caption = "The different possible entries for 'age' in the raw spreadsheet. Note how there are two identical-looking entries, probably due to an invisible character.")
avoidable_deaths %>%
count(age_min, age_max) %>%
kable(caption = "The different possible entries for 'age' in the clean data after parsing.")
```
# Notes on usage
When using this with real-world data, make sure to check whether the same record contains both an inclusion and an exclusion code according to this list! What to do in this case will differ from case to case.
# Save as `.csv`
```{r}
if (!dir.exists("./avoidable_deaths")) dir.create("./avoidable_deaths")
write_csv(avoidable_deaths, file = "./avoidable_deaths/avoidable_deaths_2020_inclusions.csv")
```