-
Notifications
You must be signed in to change notification settings - Fork 0
/
proj3.Rmd
737 lines (503 loc) · 33.4 KB
/
proj3.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
---
title: "Tranportability Case-study and Simulation Analysis"
author: "Yu Yan"
date: "2023-11-29"
output: pdf_document
bibliography: references.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = F)
knitr::opts_chunk$set(error = F)
knitr::opts_chunk$set(warning = F)
knitr::opts_chunk$set(message = F)
library(dplyr)
library(tableone)
library(mice)
library(finalfit)
library(mvnormtest)
library(MASS)
library(kableExtra)
library(gtsummary)
```
```{r}
# read datasets
framingham_df <- read.csv('framingham_df.csv')
df_2017 <- read.csv('df_2017.csv')
```
# Abstract
This report focuses on a case_study of Transportability analysis following the methodology outlined in the reference paper titled "Transporting a Prediction Model for Use in a New Target Population" by Dr. [Steingrimsson](https://pubmed.ncbi.nlm.nih.gov/?term=Steingrimsson+JA&cauthor_id=35872598) et.al. It involved using source data from Framingham Heart Study, target data from nhanes study and a model derived from source data. The first goal of this analysis report is to evaluate the performance of the given prediction model in the target population underlying the NHANES data.
The transportability analysis evaluates model performance using the Brier Score and involves model modification on combined training data, with tailored models constructed using inverse-odds weights. By evaluting on combined test data, the results indicate high predictive accuracy in both male and female subsets, with women outperforming men.
The report also briefly touches on a simulation study, conducted to test the transportability of the CVD prediction model across different data generation mechanisms. The goal is to simulate individual level data as if only summary level data is presented. The case illustrated using nhanes data. Three mechanisms are explored, each simulating NHANES data under varying assumptions. Results reveal the impact of these mechanisms on model performance, underlining the importance of considering diverse data generation scenarios when assessing model transportability in new target populations.
# Data Processing
In the data processing part, since the Framingham data is provided as filtered and complete, we will create a new variable of source and give it a value of 1 indicate it as source population, and filter age range as 30-74, indicated by [**B. D'AgostinoSr**](https://www.ahajournals.org/doi/10.1161/CIRCULATIONAHA.107.699579# "Ralph B. D’AgostinoSr") **et.al**[@dagostino2008]**.** For the nhanes data, we will do some processing as follows: filter out observations whose age is above 30 and below 74 as the eligibility criteria that matches the setting of Framingham heart study, and then created both 'SYSBP_UT' and 'SYSBP_T' the same way in the processing of Framingham data. Lastly we added 'source' and give it a value of 0 indicating this is target data. As the model that we are evaluating is stratified by different sex, we will also divide both data by sex as subsets, and perform transportability analysis seperately.
```{r}
# data processing
framingham_df <- framingham_df %>%
# Eligibility Criteria
filter(AGE>30 , AGE<74)
framingham_df$source = 1
df_2017_el <- df_2017 %>%
# Eligibility Criteria
filter(AGE>30 , AGE<74) %>%
# new cols
mutate(SYSBP_UT = ifelse(BPMEDS == 0,
SYSBP, 0)) %>%
mutate(SYSBP_T = ifelse(BPMEDS == 1,
SYSBP, 0)) %>%
# source column
mutate(source=0)
# delete id column
df_2017_el <- df_2017_el[,-1]
# change variable format
factor_col <- c('SEX','CURSMOKE','DIABETES','BPMEDS')
# men=0,women=1
framingham_df$SEX <- ifelse(framingham_df$SEX==1,0,1)
df_2017_el$SEX <- ifelse(df_2017_el$SEX==1,0,1)
# change to factor
framingham_df <- framingham_df %>% mutate_at(factor_col,as.factor)
df_2017_el <- df_2017_el %>% mutate_at(factor_col,as.factor)
# Filter to each sex
framingham_df_men <- framingham_df %>% filter(SEX == 0)
framingham_df_women <- framingham_df %>% filter(SEX == 1)
```
```{r}
par(mfrow = c(2,2))
hist(framingham_df$TOTCHOL,main = 'Histogram of Total Cholestrol',xlab ='Total Cholestrol')
hist(framingham_df$AGE, main = 'Histogram of Age',xlab ='Age')
hist(framingham_df$SYSBP,main = 'Histogram of Systolic Blood Pressure',xlab ='Systolic Blood Pressure')
hist(framingham_df$HDLC,main = 'Histogram of High-density Lipoprotein Cholesterol',xlab ='High-density Lipoprotein Cholesterol')
par(mfrow = c(1,1))
hist(framingham_df$BMI,main = 'Histogram of BMI',xlab ='BMI')
```
From the frequency, we can see that except for age, the other continuous variables looks close to normal. We will further testify their distribution later during the simulation part.
# Methods
In order to follow the original transportability analysis, this report will follow the following general analytically design:
1. Acquiring both source data and target data. The goal is to see how a candidate model, which is build based on the source data, would perform on the target data. The evaluation used is Brier Score
2. If the model is mis-specified, modify the model on the combined training dataset of source and target to get a tailored model. Then evaluate this model on combined test data set of source and target.
The model has the following formula:
$$
\log\left(\frac{P(CVD)}{1-P(CVD)}\right)
= \log(HDLC) + \log(TOTCHOL) + \log(AGE) + \log(SYSBP_{UT}+1) + \log(SYSBP_{T}+1) + CURSMOKE + DIABETES
$$
3. For generating tailored model, first estimate the probability of source population membership using training data form the source population. Second use the estimated probability to construct inverse-odds weights for the same observations. Lastly, apply the inverse-odds weights to estimate a tailor model using all source and training dataset observation.
4. The final step is to evaluate this tailored model on the pre-seperated test data sets using the Brier score Estimator as follows:
$\hat{\psi}_{\hat{\beta}}=\frac{\sum_{i=1}^n I\left(S_i=1, D_{\text {test }, i}=1\right) \hat{o}\left(X_i\right)\left(Y_i-g_{\hat{\beta}}\left(X_i\right)\right)^2}{\sum_{i=1}^n I\left(S_i=0, D_{\text {test }, i}=1\right)}$
where $\hat{o} = \frac{\operatorname{Pr}\left[S=0 \mid X, D_{\text {test }}=1\right]}{\operatorname{Pr}\left[S=1 \mid X, D_{\text {test }}=1\right]}$
Starting form train-test split, since there's missingness in the target nhanes data, we will use mice function to impute the missing data and also incorporate training test split process. By the end of MICE stage, we will have 5 unique train and test sets from target population with 75% vs 25% proportion. Next, for each of the train data of target, we will combine it with the same proportion splitted train set of source population to get a combined training set that is complete and ready for model tailoring. We will conduct the process for both men and women splitted subsets since the model should be evaluated on through such stratification and thus also tailored on such stratification.
In conclusion, with our 5 time imputation, we will end up with two lists of brier scores, one for men and the other for women. In each list, it consists of 5 different estimators of brier scores, which represent corresponding tailored model from combined training data, evaluated on combined test data. We will use average of each list as our final results for the transportability analysis.
```{r}
missing_byvar <- apply(df_2017_el[,-1], 2, function(x) (sum(is.na(x))/nrow(df_2017_el))*100 )
missing_byvar[missing_byvar>0] %>%
kable(booktabs = TRUE, caption = "Variables with missingness",col.names = 'Missing Percentage') %>%
kable_styling(full_width = TRUE, latex_options = "hold_position")
# mice
# MICE and test train split
set.seed(2550)
ignore <- sample(c(TRUE, FALSE), size = nrow(df_2017_el), replace = TRUE, prob = c(0.25, 0.75))
nhanes.imp <- mice(df_2017_el, m = 5, ignore = ignore, print = FALSE, seed = 2550)
imp_train <- filter(nhanes.imp, !ignore)
nhanes_imp_train_long <- mice::complete(imp_train,action="long")
# Store each imputed data set (Train)
nhanes_imp_train <- vector("list",5)
for (i in 1:5){
nhanes_imp_train[[i]] <- mice::complete(imp_train,i)
}
imp.test <- filter(nhanes.imp, ignore)
# Store each imputed data set (Test) long format
nhanes_imp_test_long <- mice::complete(imp.test,action="long")
# t t split for framingham
set.seed(2550)
ignore_men <- sample(c(TRUE, FALSE), size = nrow(framingham_df_men), replace = TRUE, prob = c(0.25, 0.75))
ignore_women <- sample(c(TRUE, FALSE), size = nrow(framingham_df_women), replace = TRUE, prob = c(0.25, 0.75))
framingham_men_train <- framingham_df_men[!ignore_men,]
framingham_men_test <- framingham_df_men[ignore_men,]
framingham_women_train <- framingham_df_women[!ignore_women,]
framingham_women_test <- framingham_df_women[ignore_women,]
```
```{r}
# formulas
member_formula <- as.formula(source~log(HDLC)+log(TOTCHOL)+log(AGE)+log(SYSBP_UT+1)+
log(SYSBP_T+1)+CURSMOKE+DIABETES)
pred_formula <- as.formula(CVD~log(HDLC)+log(TOTCHOL)+log(AGE)+log(SYSBP_UT+1)+
log(SYSBP_T+1)+CURSMOKE+DIABETES)
```
```{r}
# helper function
# inverse-odds weights estimator
iv_weight <- function(fit_dat, pred_dat){
# get probability of membership
prob_mod <- glm(member_formula, data = fit_dat,
family= "binomial")
w <- (1 - predict(prob_mod,newdata = pred_dat,type = 'response')) /
(predict(prob_mod,newdata = pred_dat,type = 'response'))
return(w)
}
```
# Model Evaluation
```{r }
# Form seperate combined testing dataset for men and women
men_test <- bind_rows(nhanes_imp_test_long[nhanes_imp_test_long$SEX==0,],framingham_men_test)
women_test <- bind_rows(nhanes_imp_test_long[nhanes_imp_test_long$SEX==1,],framingham_women_test)
# brier score estimator
brier_estimator <- function(train,weight) {
w <- weight
# get tailored model
mod <- glm(pred_formula,data = train[train$source==1,], family= "binomial")
source_index = which(total_test$source==1)
w_test <- iv_weight(fit_dat = total_test,pred_dat = total_test[total_test$source==1,])
weighted_eval <-
(total_test[source_index,]$CVD -
ifelse(predict(mod,newdata = total_test[source_index,],type = 'response')>=0.5,1,0) )^2
denominator <- sum(w_test * weighted_eval)
numerator <- sum(total_test$source==0)
return(denominator/numerator)
}
```
```{r}
# Form 5 combined training set for men
nhanes_imp_train_men <- lapply(nhanes_imp_train, function(dataset) {
df <- bind_rows(dataset[dataset$SEX==0,],framingham_df)
return(as.data.frame(df))
})
# Form 5 combined training set for women
nhanes_imp_train_women <- lapply(nhanes_imp_train, function(dataset) {
df <- bind_rows(dataset[dataset$SEX==1,],framingham_df)
return(as.data.frame(df))
})
# Get brier list result for men data
brier_list_men <- c()
for (i in 1:5) {
train <- nhanes_imp_train_men[[i]]
weight <- iv_weight(fit_dat = train,pred_dat = train[train$source==1,])
# get tailored model
mod <- glm(pred_formula, weights = weight,data = train[train$source==1,], family= "binomial")
w_test <- iv_weight(fit_dat = men_test,pred_dat = men_test[men_test$source==1,])
weighted_eval <-
(men_test[men_test$source==1,]$CVD -
ifelse(predict(mod,newdata = men_test[men_test$source==1,],type = 'response')>=0.5,1,0) )^2
denominator <- sum(w_test * weighted_eval)
numerator <- sum(men_test$source==0)
brier_list_men[i] <- denominator/numerator
}
# Get brier list result for women data
brier_list_women <- c()
for (i in 1:5) {
train <- nhanes_imp_train_women[[i]]
weight <- iv_weight(fit_dat = train,pred_dat = train[train$source==1,])
# get tailored model
mod <- glm(pred_formula, weights = weight,data = train[train$source==1,], family= "binomial")
w_test <- iv_weight(fit_dat = women_test,pred_dat = women_test[women_test$source==1,])
weighted_eval <-
(women_test[women_test$source==1,]$CVD -
ifelse(predict(mod,newdata = women_test[women_test$source==1,],type = 'response')>=0.5,1,0) )^2
denominator <- sum(w_test * weighted_eval)
numerator <- sum(women_test$source==0)
brier_list_women[i] <- denominator/numerator
}
```
The Brier Score measures the accuracy of predicted probabilities for binary outcomes. It ranges from 0 to 1, with lower values indicating better predictive accuracy. We can observe relative consistency of output across imputation for both male and female. The average brier score for male is about 0.1307 and that of women is 0.0557. While both value are very close to 0, indicating good predictive accuracy. This translates to the following conclusion:
Through tranportability analysis, the prediction model derived based on Framingham data is evaluted to be also perform very well in the nhances data. Predictive accuracy of women subset is better than that of men subset.
```{r}
# Result Reporting
brier_list_men[6] <- mean(brier_list_men)
names(brier_list_men) <- c('M1','M2','M3','M4','M5','Mean')
as.data.frame(round(t(brier_list_men),4))%>%
kable(booktabs = TRUE, caption = "Brier Score results for Men") %>%
kable_styling(full_width = TRUE, latex_options = "hold_position")
brier_list_women[6] <- mean(brier_list_women)
names(brier_list_women) <- c('M1','M2','M3','M4','M5','Mean')
as.data.frame(round(t(brier_list_women),4))%>%
kable(booktabs = TRUE, caption = "Brier Score results for Women") %>%
kable_styling(full_width = TRUE, latex_options = "hold_position")
```
# Simulation
We will simulate individual level data from the summary of nhances data with insights gained from Framingham data, which is the source data.
The aim of this simulation study is to test the Tranportability of CVD-prediction model generated on the Framingham data to target population in different distributions(Similarity of target population to source population). We will use several different data generation mechanisms to generate individual level data of nhance. The different data generation mechanisms will incorporate situations where the simulated data is very close to distribution of the source population and not close to it. For each data generation mechanisms, we will run 5000 simulations using the same method from the above, and report average brier score for each scenario of data generation. The reason for choosing number of simulation to be 5,000 was based on the following reasons: referring to the objectives of the study, we want to how each estimators compare to the true value of brier score and a similar example provided in the reference paper used 10,000 number of simulations. In consideration of computation time and higher model complexity in comparison to the reference paper simulation example, we decided to use 5,000 number of simulations. We will also compare the estimators to the man and women brier score from the actual data as the true estimands. The performance measures is the respective averaged brier score from each of the different data generating process. It will be compared to non-simulated nhanes dataset which corresponds to the results from the upper section.
The following are two tables of logged summary statistics of both complete case nhances and Framingham data.
```{r}
# get summary stats of nhanes
continuous_vars <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI")
report_vars <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI","BPMEDS","DIABETES",'SEX','CURSMOKE')
framingham_df_log <- framingham_df %>% mutate_at(continuous_vars,log)
df_2017_log <- df_2017_el %>% mutate_at(continuous_vars,log)
framingham_df_log %>% select_at(report_vars) %>% tbl_summary() %>%
# convert to kableExtra
as_kable_extra(booktabs = TRUE, caption = "Summary Statistics of log tranformed Framingham data") %>%
# reduce font size to make table fit.
kableExtra::kable_styling(full_width = T, font_size = 7)
```
```{r}
df_2017_log %>% select_at(report_vars) %>% tbl_summary() %>%
# convert to kableExtra
as_kable_extra(booktabs = TRUE, caption = "Summary Statistics of log tranformed Nhanes data") %>%
# reduce font size to make table fit.
kableExtra::kable_styling(full_width = T, font_size = 7)
#mshapiro.test(t(as.matrix(framingham_df[,continuous_vars])))
```
The reason for representing logged summary is that we make assumptions in the first data generation mechanism that all continuous variables follow a normal distribution after log transformation. This is a rather strong assumption and also the assumption *A1: Independence of the outcome Y and the population S, conditional on covariatesm from the paper* [@steingrimsson2022]. Therefore in the first data generation, we will generate each continuous variable based solely on the mean and sd from the above table and each binary variable as randomly generated number following binomial distribution with proportion of from the table as probability parameter. We run the simulation 5000 times, each simulation we generate 4000 cases of individual level data to match the actual nhanes data, and get the average result for both men and women subset.
```{r}
sim_dat_1 <- function(n){
#continuous_cova <- as.data.frame(mvrnorm(n=n,mu=mu_list,Sigma = cov_mat ))
sysbp <- exp(rnorm(n,4.83,0.14))
totalchl <- exp(rnorm(n,5.24,0.21))
age <- exp(rnorm(n,3.93,0.24))
hdlc <- exp(rnorm(n,3.93,0.28))
bmi <- exp(rnorm(n,3.39,0.23))
bpmeds_cova <- rbinom(n,size = 1,prob=0.326)
diab_cova <- rbinom(n,size = 1,prob=0.165)
sex <- rbinom(n,size = 1,prob=0.516)
smoke <- rbinom(n,size = 1,prob=0.202)
df <- as.data.frame(cbind(totalchl,age,sysbp,hdlc,bmi,bpmeds_cova,diab_cova,sex,smoke))
names(df) <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI","BPMEDS","DIABETES",'SEX','CURSMOKE')
df <- df %>%
# Eligibility Criteria
filter(AGE>30 , AGE<74) %>%
# new cols
mutate(SYSBP_UT = ifelse(BPMEDS == 0,
SYSBP, 0)) %>%
mutate(SYSBP_T = ifelse(BPMEDS == 1,
SYSBP, 0)) %>%
# source column
mutate(source=0) %>%
mutate_at(factor_col,as.factor)
return(df)
}
```
```{r}
sim_dat_pool <- list()
for (i in 1:1000) {
object_name <- paste("sim_", i, sep = "")
data <- sim_dat_1(n=4000)
sim_dat_pool[[object_name]] <- data
}
result_m_1 <- c()
result_w_1 <- c()
# paste(names(sim_dat_pool)[i],'_result',sep = "")
for (i in 1:length(sim_dat_pool)) {
set.seed(2550)
df <- sim_dat_pool[[i]]
df_men <- df[df$SEX==0,]
df_women <- df[df$SEX==1,]
test_index_m <- sample(c(TRUE, FALSE), size = nrow(df_men), replace = TRUE, prob = c(0.25, 0.75))
test_index_w <- sample(c(TRUE, FALSE), size = nrow(df_women), replace = TRUE, prob = c(0.25, 0.75))
sim_train_m <- bind_rows(framingham_men_train,df_men[!test_index_m,])
sim_test_m <- bind_rows(framingham_men_test, df_men[test_index_m,])
sim_train_w <- bind_rows(framingham_women_train, df_women[!test_index_w,])
sim_test_w <- bind_rows(framingham_women_test, df_women[test_index_w,])
sim_weight_train_m <- iv_weight(fit_dat = sim_train_m,pred_dat = sim_train_m[sim_train_m$source==1,])
sim_weight_train_w <- iv_weight(fit_dat = sim_train_w,pred_dat = sim_train_w[sim_train_w$source==1,])
# get tailored model
mod_m <- glm(pred_formula, weights = sim_weight_train_m,data = sim_train_m[sim_train_m$source==1,], family= "binomial")
mod_w <- glm(pred_formula, weights = sim_weight_train_w,data = sim_train_w[sim_train_w$source==1,], family= "binomial")
sim_weight_test_m <- iv_weight(fit_dat = sim_test_m,pred_dat = sim_test_m[sim_test_m$source==1,])
sim_weight_test_w <- iv_weight(fit_dat = sim_test_w,pred_dat = sim_test_w[sim_test_w$source==1,])
weighted_eval_m <-
(sim_test_m[sim_test_m$source==1,]$CVD -
ifelse(predict(mod_m,newdata = sim_test_m[sim_test_m$source==1,],type = 'response')>=0.5,1,0) )^2
weighted_eval_w <-
(sim_test_w[sim_test_w$source==1,]$CVD -
ifelse(predict(mod_w,newdata = sim_test_w[sim_test_w$source==1,],type = 'response')>=0.5,1,0) )^2
denominator_m <- sum(sim_weight_test_m * weighted_eval_m)
denominator_w <- sum(sim_weight_test_w * weighted_eval_w)
numerator_m <- sum(sim_test_m$source==1)
numerator_w <- sum(sim_test_w$source==0)
result_m_1[i] <- denominator_m/numerator_m
result_w_1[i] <- denominator_w/numerator_w
}
```
In the second data generation mechanism, we will use information from framingham data to inform simulation of nhanes data. By assuming multivariate normal distribution of all continuous variables, we will generate all the continuous variable of nhanes data by using means from log transformed nhanes original data and covariance matrix from log transformed Framingham data. For binary variable, since the proportion of each level for variable 'BPMEDS' and 'DIABETES' is very imbalanced in Framingham data, indicated form the above summary, we will generate them as usual of binomial distribution. For 'CURSMOKE' and 'SEX', we prefitted a logistic regression of each against the remaining variables using log transformed Framingham data. During the simulation stage, we will use these models to generate the variables for nanes data. This way, we hope to catch relationships and correlarions between variables within the Framingham data, and use such information to help simulation of nhanes data. We run the simulation 5000 times, each simulation we generate 4000 cases of individual level data to match the actual nhanes data, and get the average result for both men and women subset.
```{r}
continuous_vars <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI")
mu_nehanes <- colMeans(df_2017_log[,continuous_vars],na.rm = T)
cov_framingham <- cov(framingham_df_log[,continuous_vars])
# simulate over a range
mod_sex <- glm(SEX ~ TOTCHOL+AGE+SYSBP+HDLC+BMI+BPMEDS+DIABETES, family = 'binomial',data = framingham_df_log)
mod_smoke <- glm(CURSMOKE ~ TOTCHOL+AGE+SYSBP+HDLC+BMI+BPMEDS+DIABETES, family = 'binomial',data = framingham_df_log)
sim_dat_2 <- function(n){
continuous_cova <- as.data.frame(mvrnorm(n=n,mu=mu_nehanes,Sigma = cov_framingham ))
bpmeds_cova <- as.factor(rbinom(n,size = 1,prob=0.326))
diab_cova <- as.factor(rbinom(n,size = 1,prob=0.165))
df <- as.data.frame(cbind(continuous_cova,bpmeds_cova,diab_cova))
names(df) <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI","BPMEDS","DIABETES")
sex <- ifelse(predict(mod_sex,newdata = df,type = 'response')>0.5,1,0)
smoke <- ifelse(predict(mod_smoke,newdata = df,type = 'response')>0.5,1,0)
df <- as.data.frame(cbind(df,sex,smoke))
names(df) <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI","BPMEDS","DIABETES",'SEX','CURSMOKE')
df[,continuous_vars] <- exp(continuous_cova)
df <- df %>%
# Eligibility Criteria
filter(AGE>30 , AGE<74) %>%
# new cols
mutate(SYSBP_UT = ifelse(BPMEDS == 0,
SYSBP, 0)) %>%
mutate(SYSBP_T = ifelse(BPMEDS == 1,
SYSBP, 0)) %>%
# source column
mutate(source=0) %>%
mutate_at(factor_col,as.factor)
return(df)
}
```
```{r}
sim_dat_pool <- list()
for (i in 1:1000) {
object_name <- paste("sim_", i, sep = "")
data <- sim_dat_2(n=4000)
sim_dat_pool[[object_name]] <- data
}
result_m_2 <- c()
result_w_2 <- c()
# paste(names(sim_dat_pool)[i],'_result',sep = "")
for (i in 1:length(sim_dat_pool)) {
set.seed(2550)
df <- sim_dat_pool[[i]]
df_men <- df[df$SEX==0,]
df_women <- df[df$SEX==1,]
test_index_m <- sample(c(TRUE, FALSE), size = nrow(df_men), replace = TRUE, prob = c(0.25, 0.75))
test_index_w <- sample(c(TRUE, FALSE), size = nrow(df_women), replace = TRUE, prob = c(0.25, 0.75))
sim_train_m <- bind_rows(framingham_men_train,df_men[!test_index_m,])
sim_test_m <- bind_rows(framingham_men_test, df_men[test_index_m,])
sim_train_w <- bind_rows(framingham_women_train, df_women[!test_index_w,])
sim_test_w <- bind_rows(framingham_women_test, df_women[test_index_w,])
sim_weight_train_m <- iv_weight(fit_dat = sim_train_m,pred_dat = sim_train_m[sim_train_m$source==1,])
sim_weight_train_w <- iv_weight(fit_dat = sim_train_w,pred_dat = sim_train_w[sim_train_w$source==1,])
# get tailored model
mod_m <- glm(pred_formula, weights = sim_weight_train_m,data = sim_train_m[sim_train_m$source==1,], family= "binomial")
mod_w <- glm(pred_formula, weights = sim_weight_train_w,data = sim_train_w[sim_train_w$source==1,], family= "binomial")
sim_weight_test_m <- iv_weight(fit_dat = sim_test_m,pred_dat = sim_test_m[sim_test_m$source==1,])
sim_weight_test_w <- iv_weight(fit_dat = sim_test_w,pred_dat = sim_test_w[sim_test_w$source==1,])
weighted_eval_m <-
(sim_test_m[sim_test_m$source==1,]$CVD -
ifelse(predict(mod_m,newdata = sim_test_m[sim_test_m$source==1,],type = 'response')>=0.5,1,0) )^2
weighted_eval_w <-
(sim_test_w[sim_test_w$source==1,]$CVD -
ifelse(predict(mod_w,newdata = sim_test_w[sim_test_w$source==1,],type = 'response')>=0.5,1,0) )^2
denominator_m <- sum(sim_weight_test_m * weighted_eval_m)
denominator_w <- sum(sim_weight_test_w * weighted_eval_w)
numerator_m <- sum(sim_test_m$source==1)
numerator_w <- sum(sim_test_w$source==0)
result_m_2[i] <- denominator_m/numerator_m
result_w_2[i] <- denominator_w/numerator_w
}
```
For the third data generation, we will exploit the package 'fitdistrplus'[@fitdistrplus] to find the best distribution that fits for each continuous variable and get their parameter values, if we don't make assumption about all being normal distribution.
```{r}
library(fitdistrplus)
distribution_names <- c("norm", "exp", "gamma", "lnorm")
find_fits <- function(df){
fits <- list()
for (dist_name in distribution_names) {
fit <- fitdist(df, dist_name)
fits[[dist_name]] <- fit
}
aic_values <- sapply(fits, function(fit) fit$aic)
best_fit <- names(fits)[which.min(aic_values)]
return(
list(fits[[best_fit]]$distname,fits[[best_fit]]$estimate)
)
}
```
We provide four candidate distribution: normal, exponential, gamma and log normal. Then we would fit each continuous variables withe the four candidates and select the best fit by lowest AIC values. We can also draw the Cullen and Frey graph to see which distribution best fits the data. Example of graph is given below. The result is displayed as below. All variables except 'HDLC' is determined to be log normal distribution, and HDLC is best selected as gamma distribution. Following this, we will make the third data generation process follow their respective distribution and parameters. All binary variables are generated as the first generation. We run the simulation 5000 times, each simulation we generate 4000 cases of individual level data to match the actual nhanes data, and get the average result for both men and women subset.
```{r}
descdist(framingham_df$TOTCHOL, discrete=FALSE, boot=500)
chl_fit <- find_fits(framingham_df$TOTCHOL)
age_fit <- find_fits(framingham_df$AGE)
hdlc_fit <- find_fits(framingham_df$HDLC)
bmi_fit <- find_fits(framingham_df$BMI)
sysbp_fit <- find_fits(framingham_df$SYSBP)
fit_list <- list(chl_fit,age_fit,hdlc_fit,bmi_fit,sysbp_fit)
fit_names <- sapply(fit_list, function(list) list[[1]])
param_vals <- round(sapply(fit_list, function(list) list[[2]]),4)
fit_tab <- rbind(fit_names,param_vals)
colnames(fit_tab) <- c("TOTCHOL","AGE","HDLC","BMI","SYSBP")
row.names(fit_tab) <- c('Distribution','meanlog_shape','sdlog_rate')
fit_tab %>%
kable(booktabs = TRUE, caption = "Best Distribution fit for Continuous variables") %>%
kable_styling(full_width = TRUE, latex_options = "hold_position")
```
```{r}
sim_dat_3 <- function(n){
sysbp <- rlnorm(n, 4.91845791788465, 0.156296253388122)
totalchl <- rlnorm(n, 5.45326364860203, 0.185912494277935)
age <- rlnorm(n, 4.07113791586455, 0.124473761542617)
hdlc <- rgamma(n, 10.3745119984891, 0.21140498576664)
bmi <- rlnorm(n, 3.24271921386657, 0.14663705862075)
bpmeds_cova <- rbinom(n,size = 1,prob=0.326)
diab_cova <- rbinom(n,size = 1,prob=0.165)
sex <- rbinom(n,size = 1,prob=0.516)
smoke <- rbinom(n,size = 1,prob=0.202)
df <- as.data.frame(cbind(totalchl,age,sysbp,hdlc,bmi,bpmeds_cova,diab_cova,sex,smoke))
names(df) <- c("TOTCHOL","AGE","SYSBP","HDLC","BMI","BPMEDS","DIABETES",'SEX','CURSMOKE')
df <- df %>%
# Eligibility Criteria
filter(AGE>30 , AGE<74) %>%
# new cols
mutate(SYSBP_UT = ifelse(BPMEDS == 0,
SYSBP, 0)) %>%
mutate(SYSBP_T = ifelse(BPMEDS == 1,
SYSBP, 0)) %>%
# source column
mutate(source=0) %>%
mutate_at(factor_col,as.factor)
return(df)
}
```
\
```{r}
sim_dat_pool <- list()
for (i in 1:1000) {
object_name <- paste("sim_", i, sep = "")
data <- sim_dat_3(n=4000)
sim_dat_pool[[object_name]] <- data
}
result_m_3 <- c()
result_w_3 <- c()
# paste(names(sim_dat_pool)[i],'_result',sep = "")
for (i in 1:length(sim_dat_pool)) {
set.seed(2550)
df <- sim_dat_pool[[i]]
df_men <- df[df$SEX==0,]
df_women <- df[df$SEX==1,]
test_index_m <- sample(c(TRUE, FALSE), size = nrow(df_men), replace = TRUE, prob = c(0.25, 0.75))
test_index_w <- sample(c(TRUE, FALSE), size = nrow(df_women), replace = TRUE, prob = c(0.25, 0.75))
sim_train_m <- bind_rows(framingham_men_train,df_men[!test_index_m,])
sim_test_m <- bind_rows(framingham_men_test, df_men[test_index_m,])
sim_train_w <- bind_rows(framingham_women_train, df_women[!test_index_w,])
sim_test_w <- bind_rows(framingham_women_test, df_women[test_index_w,])
sim_weight_train_m <- iv_weight(fit_dat = sim_train_m,pred_dat = sim_train_m[sim_train_m$source==1,])
sim_weight_train_w <- iv_weight(fit_dat = sim_train_w,pred_dat = sim_train_w[sim_train_w$source==1,])
# get tailored model
mod_m <- glm(pred_formula, weights = sim_weight_train_m,data = sim_train_m[sim_train_m$source==1,], family= "binomial")
mod_w <- glm(pred_formula, weights = sim_weight_train_w,data = sim_train_w[sim_train_w$source==1,], family= "binomial")
sim_weight_test_m <- iv_weight(fit_dat = sim_test_m,pred_dat = sim_test_m[sim_test_m$source==1,])
sim_weight_test_w <- iv_weight(fit_dat = sim_test_w,pred_dat = sim_test_w[sim_test_w$source==1,])
weighted_eval_m <-
(sim_test_m[sim_test_m$source==1,]$CVD -
ifelse(predict(mod_m,newdata = sim_test_m[sim_test_m$source==1,],type = 'response')>=0.5,1,0) )^2
weighted_eval_w <-
(sim_test_w[sim_test_w$source==1,]$CVD -
ifelse(predict(mod_w,newdata = sim_test_w[sim_test_w$source==1,],type = 'response')>=0.5,1,0) )^2
denominator_m <- sum(sim_weight_test_m * weighted_eval_m)
denominator_w <- sum(sim_weight_test_w * weighted_eval_w)
numerator_m <- sum(sim_test_m$source==1)
numerator_w <- sum(sim_test_w$source==0)
result_m_3[i] <- denominator_m/numerator_m
result_w_3[i] <- denominator_w/numerator_w
}
```
Finally, we compile all the results together into this table. The true result is repersented as the result we get first hand from the individual level data of nhanes. Through comparison, we can see that data generation 1 has the closest model performance in terms of transportability analysis with the true result. The third generation mechanism, which includes testing of univariate distribution, has relative large value of brier scores. This means the model trained on Framingham data generate poorly on the data simulated under this mechanism.
```{r}
# comparison
final_dat <- rbind(c(brier_list_men[6],mean(result_m_1),mean(result_m_2),mean(result_m_3)),c(brier_list_women[6],mean(result_w_1),mean(result_w_2),mean(result_w_3)))
rownames(final_dat) <- c('Men','Women')
colnames(final_dat) <- c('True','Gen_1','Gen_2','Gen_3')
final_dat %>%
kable(booktabs = TRUE, caption = "Average Brier Score Comparison between True and differnt data generation") %>%
kable_styling(full_width = TRUE, latex_options = "hold_position")
```
# Conclusion and Discussion
The successful transportability of the CVD prediction model from the Framingham source population to the NHANES target population is a noteworthy application. The model's high predictive accuracy in both male and female subsets of the NHANES data indicates its potential applicability across diverse populations, offering promise for its real-world utility. The gender-specific analysis also highlights the importance of considering sex-based differences in CVD risk factors, providing valuable insights for tailoring prediction models to specific subgroups within target populations.
However, it's important to acknowledge certain limitations in this analysis. The assumption of source population representatives and the simplifications made in the simulation study might not capture all demographic and lifestyle variations between the Framingham and NHANES datasets. As is also mentioned in the refernce paper the two assumptions: Independence of the outcome Y and the population S, conditional on covariates; and positivity. This can also be partly illustrated in the simulation part that varying distribution assumption of the target data could impact greatly the performance of model that is derived on source population, even though the model gets tailored before applied to the target data.
Furthermore, the sole reliance on the Brier Score as an evaluation metric, without considering additional performance metrics or external validation measures, could provide a limited perspective on model transportability.The report could incorporate other predicitvce measures such as AUC and ROC curves for classification tasks. Only to note the estimator must be tailored and account for the fact that there's potentially no outcome data in the source variable.
In conclusion, while these findings are promising, they underscore the need for further research to refine and validate the model's performance in real-world scenarios with diverse target populations, considering a broader range of evaluation measures and accounting for potential temporal changes.
# Reference