-
Notifications
You must be signed in to change notification settings - Fork 0
/
community_contribution.qmd
493 lines (373 loc) · 22.3 KB
/
community_contribution.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
---
title: "Survival Analysis in R Demo with Competing Risk"
author: "Sophie Wagner (sw3767)"
echo: true
output: html
---
# Survival Analysis in R Demo
## Setup
### Load packages
```{r}
#| warning: false
rm(list = ls()) # Clearing Environment and Plots
library(dplyr) # For data manipulation
library(forcats) # For easy factor conversion
library(rstpm2) # For hazard modelling
library(cmprsk) # For competing risk analysis
library(tidycmprsk) # For competing risk analysis
library(survminer) # For plotting
library(ggplot2) # For plotting
library(ggsurvfit) # For plotting
library(readr) # Read in csv efficiently
```
### Read in data
**Data source info:**
SEER*Stat Version: 8.4.4*
*Session Type: Case Listing*
*Software: Surveillance Research Program, National Cancer Institute SEER*Stat software (www.seer.cancer.gov/seerstat) version 8.4.4.
Data: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) SEER
*SEER\*Stat Database: Incidence - SEER Research Data, 8 Registries, Nov 2023 Sub (1975-2021) - Linked To County Attributes - Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties, National Cancer Institute, DCCPS, Surveillance Research Program, released April 2024, based on the November 2023 submission.*
- Primary site: Breast
- Behavior: Malignant
- Sequence: First primary cancer
- Diagnosis reporting source: Not diagnosed by autopsy or death certificate
- Year of diagnosis: 2010-2015
```{r}
#| message: false
#| warning: false
d <- read_csv("seer_breast_case_listings.csv", col_types = cols(.default = col_factor()))
```
# Data cleaning
Steps taken: Convert variables to desired type Drop irrelevant columns Handle missing and unknown vals Covariate categorization (collapse categories)
### Convert variables to desired type
```{r}
# Convert some cols from factor to numeric or character
d <- d |>
mutate(`Year of diagnosis`=as.numeric(`Year of diagnosis`),
`Year of follow-up recode`=as.numeric(`Year of follow-up recode`),
`Survival months`=as.numeric(`Survival months`),
`Patient ID`=as.character(`Patient ID`))
```
We need the year and survival data to be numeric. Patient ID doesn't need to be a factor.
### Drop cols with redundant information
```{r}
d <- d |>
select(-c(`Race recode (White, Black, Other)`,
`Race and origin (recommended by SEER)`,
`ER Status Recode Breast Cancer (1990+)`,
`PR Status Recode Breast Cancer (1990+)`,
`Derived HER2 Recode (2010+)`)
)
```
Let's take a look at the data:
```{r}
head(d)
```
### Drop unknowns where necessary
```{r}
d_full <- d |>
filter(`Rural-Urban Continuum Code`!= "Unknown/missing/no match/Not 1990-2022",
`Median household income inflation adj to 2022`!="Unknown/missing/no match/Not 1990-2022",
!`Breast Subtype (2010+)` %in% c("Unknown","Recode not available"),
`SEER historic stage A (1973-2015)` != "Unstaged",
`Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)`!="Non-Hispanic Unknown Race")
```
### Collapse categorical variables
```{r}
summary(d)
```
What are these (Other) categories? Expand levels:
```{r}
levels(d$`Median household income inflation adj to 2022`)
levels(d$`Radiation recode`)
levels(d$`Histology recode - broad groupings`)
levels(d$`Age recode with <1 year olds`)
```
Explore income variable:
```{r}
# Create ordered factor income variable
d_full$income.o <- factor(
d_full$`Median household income inflation adj to 2022`,
levels = c("< $40,000", "$40,000 - $44,999", "$45,000 - $49,999",
"$50,000 - $54,999", "$55,000 - $59,999", "$60,000 - $64,999",
"$65,000 - $69,999", "$70,000 - $74,999", "$75,000 - $79,999",
"$80,000 - $84,999", "$85,000 - $89,999", "$90,000 - $94,999",
"$95,000 - $99,999", "$100,000 - $109,999", "$110,000 - $119,999",
"$120,000+"),
ordered = TRUE)
# Create numeric income variable
d_full$income.n <- as.numeric(gsub("[^0-9]", "", sub(" - .*", "",
d_full$`Median household income inflation adj to 2022`)))
```
```{r}
summary(d_full$income.n)
```
```{r}
#| fig-width: 15
#| fig-height: 10
ggplot(d_full, aes(y=income.o))+
geom_bar()
```
Now that we know a bit more about the distribution of income, we can collapse it in a way that seems intuitive. We will also collapse some other vars at the same time based on domain knowledge:
```{r}
d_full <- d_full |>
mutate(
income_quantile = factor(
fct_collapse(
`Median household income inflation adj to 2022`,
Q1 = c("< $40,000", "$40,000 - $44,999", "$45,000 - $49,999",
"$50,000 - $54,999","$55,000 - $59,999", "$60,000 - $64,999",
"$65,000 - $69,999"),
Q2 = c("$70,000 - $74,999", "$75,000 - $79,999"),
Q3 = c("$80,000 - $84,999", "$85,000 - $89,999"),
Q4 = c("$90,000 - $94,999", "$95,000 - $99,999", "$100,000 - $109,999",
"$110,000 - $119,999", "$120,000+")
),
ordered=TRUE
),
income_group = factor(
fct_collapse(
`Median household income inflation adj to 2022`,
"<60k" = c("< $40,000", "$40,000 - $44,999", "$45,000 - $49,999",
"$50,000 - $54,999","$55,000 - $59,999"),
"60-80k"=c("$60,000 - $64,999", "$65,000 - $69,999","$70,000 - $74,999",
"$75,000 - $79,999"),
"80-100k"=c("$80,000 - $84,999", "$85,000 - $89,999","$90,000 - $94,999",
"$95,000 - $99,999"),
">100k"=c("$100,000 - $109,999", "$110,000 - $119,999", "$120,000+"),
),
ordered=TRUE
),
urban_rural=factor(
fct_recode(
`Rural-Urban Continuum Code`,
"Urban large"="Counties in metropolitan areas ge 1 million pop",
"Urban med"="Counties in metropolitan areas of 250,000 to 1 million pop",
"Urban med"="Counties in metropolitan areas of lt 250 thousand pop",
"Suburban"="Nonmetropolitan counties adjacent to a metropolitan area",
"Rural"="Nonmetropolitan counties not adjacent to a metropolitan area"
),
ordered=TRUE
),
marital_status = fct_collapse(
`Marital status at diagnosis`,
"Married/Partnered"=c("Married (including common law)",
"Unmarried or Domestic Partner"),
"Divorced/Separated"=c("Divorced","Separated"),
"Single (never married)"="Single (never married)",
"Unknown"="Unknown"
)
)
```
Let's rename some vars for ease of use:
```{r}
d_full <- d_full |> rename(age_at_dx=`Age recode with <1 year olds`,
stage=`SEER historic stage A (1973-2015)`,
radiation=`Radiation recode`,
chemo=`Chemotherapy recode (yes, no/unk)`,
race_recode = `Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)`,
other_death.f=`SEER other cause of death classification`,
cancer_death.f=`SEER cause-specific death classification`,
all_death.f=`Vital status recode (study cutoff used)`,
# er_status=`ER Status Recode Breast Cancer (1990+)`,
# pr_status=`PR Status Recode Breast Cancer (1990+)`,
# her2_status=`Derived HER2 Recode (2010+)`,
breast_subtype=`Breast Subtype (2010+)`)
```
Make some categories ordinal:
```{r}
d_full$age_at_dx <- factor(d_full$age_at_dx,
levels = c("01-04 years", "05-09 years", "15-19 years",
"20-24 years", "25-29 years", "30-34 years",
"35-39 years", "40-44 years", "45-49 years",
"50-54 years", "55-59 years", "60-64 years",
"65-69 years", "70-74 years", "75-79 years",
"80-84 years", "85+ years"),
ordered = TRUE)
d_full$stage <- factor(d_full$stage,
levels = c("Localized", "Regional","Distant"),
ordered=TRUE)
```
Take a look at age distribution:
```{r}
ggplot(d_full, aes(x=age_at_dx))+
geom_bar() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
# Creating survival and competing risk variables
In survival analysis, we have a **status** variable which indicates whether an individual is still in the cohort (**censored**, meaning alive or lost to follow-up) or has experienced the event of interest (typically death, coded as 1 or higher). We define three specific status variables:
Single-event variables:
`cancer_death`: Indicates 0 if the person is alive or has died from a non-cancer cause, and 1 if they died specifically from cancer (event of interest). `other_death`: Indicates 0 if the person is alive or died from cancer, and 1 if they died from a cause other than cancer (event of interest). `all_death`: Shows 0 if alive and 1 if deceased from any cause, effectively summing `cancer_death` and `other_death`.
Generally, we use `all_death` for **overall survival** analyses because it provides a comprehensive measure of survival, accounting for all possible causes of death, thus capturing overall mortality risk. For **cancer-specific survival**, we focus on `cancer_death`, which isolates cancer as the primary event of interest.
Multi-event variables:
The Competing Risk Status (`cmp_status`) variable combines the outcomes death from cancer and death from other causes, categorizing individuals as alive, cancer-specific death (CSD), or all-cause mortality (ACM).
In **competing risks analysis**, both `cancer_death` and `other_death` are combined into a single variable, allowing us to assess the probabilities of these events within one framework rather than evaluating each separately.
We will recode all of these in both factor form and 0/1/2.. numeric. It is smart to have both factor and numeric versions for the required packages.
```{r}
d_full$other_death.n <- as.numeric(
as.character(
fct_recode(
d_full$other_death.f,
"0"= "Alive or dead due to cancer",
"1"= "Dead (attributable to causes other than this cancer dx)"
)
))
d_full$cancer_death.n <- as.numeric(
as.character(
fct_recode(
d_full$cancer_death.f,
"0"= "Alive or dead of other cause",
"1"= "Dead (attributable to this cancer dx)"
)
))
d_full$all_death.n <- as.numeric(
as.character(
fct_recode(
d_full$all_death.f,
"0"= "Alive",
"1"= "Dead"
)
))
```
Create our competing risk variable.
```{r}
d_full$cmp_status.f <- ifelse(
d_full$cancer_death.f=="Dead (attributable to this cancer dx)", "CSD",
ifelse(d_full$other_death.f=="Dead (attributable to causes other than this cancer dx)", "ACM", "Alive"))
d_full$cmp_status.n <- as.numeric(
as.character(
fct_recode(
d_full$cmp_status.f,
"0"="Alive",
"1"="CSD",
"2"="ACM"
)
))
```
# Survival analysis
To conduct a detailed survival and competing risks analysis, we use specialized packages like `survival`, `cmprsk`, `tidycmprsk`, and `ggsurvfit`. These packages allow us to estimate survival probabilities, visualize Kaplan-Meier curves, and handle competing risks with cumulative incidence functions (CIF).
survival: Core package for survival analysis in R. It includes functions to create survival objects (Surv) and fit survival models (survfit for Kaplan-Meier and coxph for Cox Proportional Hazards models). Highly customizable, but lacks some visualization features.
ggsurvfit: A wrapper around survival that enhances plot functionality by integrating with ggplot2, allowing easier addition of elements like risk tables and confidence intervals.
cmprsk: Provides functions for competing risks analysis, specifically cumulative incidence functions (CIF) via cuminc and Fine-Gray models (crr). Focus is stats, lacks ggplot2 compatibility
tidycmprsk: A tidyverse-friendly wrapper around cmprsk functions, simplifying syntax and enhancing compatibility with ggplot2 for visualization.
Using these different packages output different types of objects, which can get confusing. Let's take is step by step and see a breakdown of how the components work together, along with their benefits.
## Kaplan-Meier Survival Analysis
### Creating a survival object
Survival objects (`Surv()`) in `survival` and `ggsurvfit` represent time-to-event data.
```{r}
overall_surv_obj <- survival::Surv(time = d_full$`Survival months`,
event = d_full$all_death.n)
```
```{r}
overall_surv_obj <- ggsurvfit::Surv(time = d_full$`Survival months`,
event = d_full$all_death.n)
```
The above gives us a **survival object**. This object can then be used as a **response variable** in survival models. (We don't get that much information out of it until we use it in a survival model.)
```{r}
summary(overall_surv_obj)
```
The summary of the survival object doesn't display the survival probabilities directly, but does reveal the status and distribution of censored vs. event cases.
### Fit Kaplan Meier curve
Lets's now compute a survival curve using our censored time-to-event data.
The `ggsurvfit` package makes this easy. Below, we will use `ggsurvfit::survfit2` function which estimates survival probabilities over time. It does the same thing as `survival::survfit` but returns an environment compatible with `ggplot2`-good for plotting.
```{r}
overall_surv_fit <- survival::survfit(overall_surv_obj ~ stage, data= d_full)
```
```{r}
overall_surv_fit <- ggsurvfit::survfit2(overall_surv_obj ~ stage, data=d_full)
```
```{r}
overall_surv_fit
```
```{r}
summary(overall_surv_fit, times=c(12,24,36,48,60,72,84,96,108,120))
```
### Plotting the Kaplan-Meier curve
```{r}
ggsurvfit::ggsurvfit(overall_surv_fit) +
add_confidence_interval() +
add_risktable(risktable_stats="n.risk") +
scale_x_continuous(limits = c(0, 120), breaks = seq(0, 120, by = 12)) +
scale_y_continuous(limits = c(0, 1), labels = scales::percent_format(), breaks = seq(0, 1, by = 0.25))
```
```{r}
survminer::ggsurvplot(overall_surv_fit,
risk.table=T,
risk.table.height=0.3, #scale 0-1
pval = T,
conf.int=T,
palette = "Blues",
surv.median.line="hv",
title="Survival plot",
xlab="Months",
ylab="Survival probability",
xlim=c(0,120),
break.x.by=24)
```
return an object of class ggsurvplot which is list containing the following components:
plot: the survival plot (ggplot object)
table: the number of subjects at risk table per time (ggplot object).
cumevents: the cumulative number of events table (ggplot object).
ncensor.plot: the number of censoring (ggplot object).
data.survplot: the data used to plot the survival curves (data.frame).
data.survtable: the data used to plot the tables under the main survival curves (data.frame).
Can mix and match up to a certain point: If we create with `survival` we cannot add risk table and some other elements that ggsurvfit can do easily.
### Getting the Cox PH Hazard Ratio
To calculate hazard ratios, we fit a Cox Proportional Hazards (Cox PH) model and check for proportional hazards assumptions using residuals.
Fit the coxph model:
```{r}
# Fit model
overall_ph <- survival::coxph(overall_surv_obj ~ stage + marital_status + urban_rural + income_quantile, data=d_full) # need survival pkg
# Extract HRs
summary(overall_ph)
```
Test for proportional hazards assumption:
```{r}
overall_test <- survival::cox.zph(overall_ph)
overall_test
survminer::ggcoxzph(overall_test)
```
The ggcoxzph function from survminer provides plots for each covariate, indicating whether the proportional hazards assumption holds. *Non-horizontal lines suggest potential violations.* And we want a global p-value that is NOT significant (i.e. no significant evidence that the hazard ratio changes over time). Note if you get a significant value, look into: - stratifying by survival time (less than 5 years, greater than 5 years, e.g.) - creating an interaction term between time variable and a covariate that doesn't pass (see if significant relationship)
`survminer` gives you a wrapper around plot.cox.zph and makes plotting super easy!
## Competing risks analysis
In cases where multiple events can occur, (in this case a patient can die from either cancer OR other causes) competing risks analysis is essential. KM cannot account for competing events -- instead, we can look at cumulative incidence functions (CIFs) to estimate the probability of each event type -- while acknowledging that competing events reduce the probability of observing the primary event of interest.
There are a couple of different packages out there for competing risk analysis, and it gets tricky figuring out which one to use!
Cumulative incidence function (CIF) objects in cmprsk and tidycmprsk handle competing risks, representing the probability of specific events over time.
`cmprsk` is the original competing risk package
`tidycmprsk` provides a tidyverse wrapper which makes it easy for plotting and compatible with `ggsurvfit`
Again, we will start by fitting a model, this time with the event variable having more than 1 event. We will use our cmp_status variable, coded with 0=Alive, 1=Cancer-specific death, and 2=Other cause of death.
`cmprsk::cuminc` can take either factor or numeric. Make sure to specify which level is your censoring variable.
```{r}
cif_obj <- cmprsk::cuminc(ftime=d_full$`Survival months`,
fstatus=d_full$cmp_status.f,
cencode="Alive",
group=d_full$stage)
```
```{r}
cmprsk::plot.cuminc(cif_obj,main="Cumulative inc funciton", xlab="Months",
color=c("blue","red","green","blue","red","green"))
```
Or, we can do it much more easily using the tidycmprsk package! Note: must use tidycmprsk object to plot it
```{r}
cr_surv_obj <- tidycmprsk::Surv(d_full$`Survival months`, as.factor(d_full$cmp_status.n))
cif_ggfit <- tidycmprsk::cuminc(Surv(`Survival months`, as.factor(cmp_status.n)) ~ stage, data=d_full)
# Or pass Surv directly and model will default to tidycmprsk::Surv
cif_ggfit <- tidycmprsk::cuminc(Surv(`Survival months`, as.factor(cmp_status.n)) ~ stage, data=d_full)
cif_ggfit
```
Filename: Case Listing Session-1 Matrix-2 SEER\*Stat Version: 8.4.4 Date: November 5, 2024
Session Type: Case Listing
SUGGESTED CITATION\
Software: Surveillance Research Program, National Cancer Institute SEER*Stat software (www.seer.cancer.gov/seerstat) version 8.4.4. Data: Surveillance, Epidemiology, and End Results (SEER) Program (www.seer.cancer.gov) SEER*Stat Database: Incidence - SEER Research Data, 8 Registries, Nov 2023 Sub (1975-2021) - Linked To County Attributes - Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties, National Cancer Institute, DCCPS, Surveillance Research Program, released April 2024, based on the November 2023 submission.
DATA\
Database: Incidence - SEER Research Data, 8 Registries, Nov 2023 Sub (1975-2021) - Linked To County Attributes - Time Dependent (1990-2022) Income/Rurality, 1969-2022 Counties
SELECTION\
Select Only: Malignant Behavior, Known Age
Case: {Site and Morphology.Site recode ICD-O-3/WHO 2008} = 'Breast' AND {Other.Type of Reporting Source} != 'Autopsy only', 'Death certificate only' AND {Cause of Death (COD) and Follow-up.Survival months flag} != 'Not calculated because a Death Certificate Only or Autopsy Only case' AND {Cause of Death (COD) and Follow-up.Survival months} != 'Unknown' AND {Cause of Death (COD) and Follow-up.SEER other cause of death classification} != 'Dead (missing/unknown COD)', 'N/A not seq 0-59' AND {Cause of Death (COD) and Follow-up.SEER cause-specific death classification} != 'Dead (missing/unknown COD)', 'N/A not seq 0-59' AND {Race, Sex, Year Dx.Sex} = 'Female' AND {Multiple Primary Fields.First malignant primary indicator} = 'Yes' AND {Race, Sex, Year Dx.Year of diagnosis} = '2010', '2011', '2012', '2013', '2014', '2015'
OUTPUT\
Title: Breast Cancer survival data - Diagnoses 2010-2015
TABLE\
Column: Patient ID Age recode with \<1 year olds Race recode (White, Black, Other) Year of diagnosis Laterality Histology recode - broad groupings SEER historic stage A (1973-2015) Radiation recode Chemotherapy recode (yes, no/unk) ER Status Recode Breast Cancer (1990+) PR Status Recode Breast Cancer (1990+) Derived HER2 Recode (2010+) Breast Subtype (2010+) SEER cause-specific death classification SEER other cause of death classification Survival months Vital status recode (study cutoff used) Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic) Year of follow-up recode Year of death recode Marital status at diagnosis Median household income inflation adj to 2022 Rural-Urban Continuum Code Race and origin (recommended by SEER) \[Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic); PRCDA 2020\]
USER DEFINITIONS\
Race and origin (recommended by SEER) \[Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic); PRCDA 2020\] Description: These are the groupings typcailly used by SEER for reporting by race and ethnicity for analyses starting in 1990 or later. For more details see:\
https://seer.cancer.gov/seerstat/variables/seer/race_ethnicity All races/ethnicities: {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Non-Hispanic White', 'Non-Hispanic Black', 'Non-Hispanic American Indian/Alaska Native', 'Non-Hispanic Asian or Pacific Islander', 'Hispanic (All Races)', 'Non-Hispanic Unknown Race' Non-Hispanic White: {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Non-Hispanic White' Non-Hispanic Black: {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Non-Hispanic Black' Non-Hispanic American Indian/Alaska Native (PRCDA counties only): {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Non-Hispanic American Indian/Alaska Native' AND {Race, Sex, Year Dx.PRCDA 2020} = 'PRCDA' Non-Hispanic Asian or Pacific Islander: {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Non-Hispanic Asian or Pacific Islander' Hispanic (All Races): {Race, Sex, Year Dx.Race and origin recode (NHW, NHB, NHAIAN, NHAPI, Hispanic)} = 'Hispanic (All Races)'