simulation.Rmd

---
title: "Causal2 Project Part2: Simulation"
author: "Namita Trikannad"
date: "11/21/2019"
output: pdf_document
---

```{r setup, include=FALSE}
# set default knitr chunks
knitr::opts_chunk$set(
  tidy.opts=list(width.cutoff=60),tidy=TRUE,
  echo = FALSE,  # don't print the code chunk
  results = 'hide',
  warning = FALSE,  # don't print warnings
  message = FALSE,  # don't print messages
  fig.width = 6,  # set default width of figures
  fig.height = 4,  # set default height of figures
  fig.align = "center",  # always align figure in center
  fig.pos = "H",  # always plot figure at the exact location of the code chunk
  cache = T)  # don't cache results  #CHANGE THIS TO "FALSE" AFTER YOU'RE DONE !!

library(tidyverse)
library(data.table)
library(ltmle)
library(SuperLearner)
library(kableExtra)
```

```{r}
quiet_ltmle <- function(...){suppressMessages(ltmle::ltmle(...))}
```

```{r, loadData}
load("data/female_SCM_imputed.RData")
```

```{r, exploreData}
skimr::skim(female_SCM)
str(female_SCM)
summary(female_SCM)
```

Explanation of variable names: 
1. `"CODENUM"`: Identification number 
2. `"cohort"`: Indicates the cohort to which the individual belongs (90 or 91)
3. : Binary indicating whether individual was victimized (1) or not (0) at time t 
4. : Binary indicating suicide attempt or ideation was made (1) or not (0) at time t 
5. : Binary indicating whether in a relationship (1) or not (0) status at time t 
6. : Binary indicating individual is religious (1) or not (0) at time t 
9. : Binary indicating whether (1) or not (0) the victim knew other women who were sexually victimized 
10. : Binary indicating drug use (1) or not (0) at time t 
11. : Binary indicating alcohol (yes for 1 and no for 0) use at time t 
7. : Continuous scale showing psychological well being (higher number indicates well being)

12. 

```{r, cleaning}
# This steps converts logical variables to binary
cols <-  which(sapply(female_SCM, is.logical))
setDT(female_SCM)
for(j in cols){
set(female_SCM, i=NULL, j=j, value= as.numeric(female_SCM[[j]]))
}

# We have haven_labelled class for the 'psy' variables which don't work with ggplot
# So fix those below 

for (i in 1:4) {
  female_SCM[[paste0("psywell",i)]] <- as.numeric(female_SCM[[paste0("psywell",i)]])
}

female_SCM$consent <- as.numeric(female_SCM$consent)
female_SCM$race <- as.numeric(female_SCM$race)

# check
colSums(is.na(female_SCM))
```

Simulation: See also “For Your Project” questions from R lab 1. 
i.	Describe your simulation: We have the the following types of variables in our data and they were simulated as follows: 
*All baseline covariates:* `"race", "sxorient", "consent", "forced", "forced_att", "sex_bef14"`

*Binary variables in the data:* All of the below variables were generated as binomial random variables with `p` as the mean of this variable in the observed data 
A(t) : Victimization `"victim"`
L1 (t) : Suicidal tendency or ideation observed `"suicide"`
L2 (t) : Relationship status `"relation"`
L3 (t) : Religious status `"religious`
L4 (t) : Whether the victim knew other women who were victimized `"victfrnd"`
L5 (t) : Drug use `"drug"`
L6 (t) : Alcohol use `"alcohol"`
C(t) : Censored at the time t 

*Continuous variables in the data:* The following two variables show somewhat normal distribution and thefore were generated as normal distribution with mean and variance as that of the original data. 
L7 (t) : Psychological well being on a scale (higher value indicates well being) `"psywell"`

*Discrete variable:*
Y : Outcome, which is accrued number of times suicidal tendency or ideation was recorded at the end of the study (Sum of L1(t) at each t = 1,2,3,4)

# EDA (before simulation)
```{r someEDA}
summary(female_SCM)


#Example for psywell
psy_plots <- list()
for (i in 1:4){
  psy_plots[[i]] <- local({
        i <- i
  ggplot(female_SCM) + 
    geom_histogram(aes(x = get(paste0("psywell",i))), color = "black", alpha = "0.5") +
    labs(x = sprintf("psywell%s",i)) +
    theme(plot.title = element_text(size =14, face = "bold", hjust = 0.5)) 
  })
}

ggarrange(psy_plots[[1]], psy_plots[[2]], psy_plots[[3]], psy_plots[[4]]) %>%
  annotate_figure(top = "Psychological Well Being Score Distribution Across Year 1-4")

female_SCM_copy <- female_SCM
colnames(female_SCM_copy) <- c("W1", "W2", "W3", "W4", "W5", "W6", "L1_0", 
                               "L2_0", "L3_0", "L4_0", "L5_0", "L6_0", "L7_0", 
                               "C1", "A1", "L1_1", "L2_1", "L3_1", "L7_1", 
                               "L4_1", "L5_1", "L6_1", "C2", "A2", "L1_2", 
                               "L2_2", "L3_2", "L7_2", "L4_2", "L5_2", "L6_2", 
                               "C3", "A3", "L1_3", "L2_3", "L3_3", "L7_3", 
                               "L4_3", "L5_3", "L6_3", "C4", "A4", "Y")

# Fit models fo suicide L1 at each time point 
L1_1_lm <- glm(L1_1 ~ W1 + W2 + W3 + W4 + W5 + W6 + L1_0 + L2_0 + L3_0 + L4_0 + 
                 L5_0 + L6_0 + L7_0 + A1 + C1, data = female_SCM_copy)
L1_2_lm <- glm(L1_2 ~ W1 + W2 + W3 + W4 + W5 + W6 + L1_0 + L2_0 + L3_0 + L4_0 + 
                 L5_0 + L6_0 + L7_0 + A1 + C1 + L1_1 + L2_1 + L3_1 + L4_1 + 
                 L5_1 + L6_1 + L7_1 + A2 + C2, data = female_SCM_copy)
L1_3_lm <- glm(L1_3 ~ W1 + W2 + W3 + W4 + W5 + W6 + L1_0 + L2_0 + L3_0 + L4_0 + 
                 L5_0 + L6_0 + L7_0 + A1 + C1 + L1_1 + L2_1 + L3_1 + L4_1 + 
                 L5_1 + L6_1 + L7_1 + A2 + C2 + L1_2 + L2_2 + L3_2 + L4_2 + 
                 L5_2 + L6_2 + L7_2 + C3 + A3, data = female_SCM_copy)
L1_4_lm <- glm(L1_3 ~ W1 + W2 + W3 + W4 + W5 + W6 + L1_0 + L2_0 + L3_0 + L4_0 + 
                 L5_0 + L6_0 + L7_0 + A1 + C1 + L1_1 + L2_1 + L3_1 + L4_1 + 
                 L5_1 + L6_1 + L7_1 + A2 + C2 + L1_2 + L2_2 + L3_2 + L4_2 + 
                 L5_2 + L6_2 + L7_2 + C3 + A3 + C4 + A4, data = female_SCM_copy)
```

# Simulation: 
Generate data: 
```{r generate}
generate_data_intervene <- function(n) {
  
 df <- data.frame( # W:
                   W1 = rbinom(n, 1, prob = 0.75) + 1,
                   W2 = rbinom(n, 1, prob = 0.03),
                   W3 = rbinom(n, 1, prob = 0.306) + 1,
                   W4 = rbinom(n, 1, prob = 0.47),
                   W5 = rbinom(n, 1, prob = 0.207),
                   W6 = rbinom(n, 1, prob = 0.412),
   
   # Intervention node: prob based on the mean of victim in that year
                   A1 = rbinom(n, 1, prob = 0.6), 
                   A2 = rbinom(n, 1, prob = 0.4),
                   A3 = rbinom(n, 1, prob = 0.3),
                   A4 = rbinom(n, 1, prob = 0.3),
                   
                   # Suicidal tendency: prob based on the mean of suicide in that year
                   L1_0 = rbinom(n, 1, prob = 0.1),
                   #L1_2 = rbinom(n, 1, prob = 0.1),
                   #L1_3 = rbinom(n, 1, prob = 0.1),
                   #L1_4 = rbinom(n, 1, prob = 0.06),
                  
                   
                   # Relationship status: prob based on the mean of relation in that year
                   L2_0 = rbinom(n, 1, prob = 0.25),
                   L2_1 = rbinom(n, 1, prob = 0.5),
                   L2_2 = rbinom(n, 1, prob = 0.5),
                   L2_3 = rbinom(n, 1, prob = 0.5),
                   
                   # Religious status: prob based on the mean of religion in that year
                   L3_0 = rbinom(n, 1, prob = 0.7),
                   L3_1 = rbinom(n, 1, prob = 0.7),
                   L3_2 = rbinom(n, 1, prob = 0.7),
                   L3_3 = rbinom(n, 1, prob = 0.7),
                   
                   # Victfriend: prob based on the mean of victfriend in that year
                   L4_0 =  rbinom(n, 1, prob = 0.6),
                   L4_1 = rbinom(n, 1, prob = 0.5),
                   L4_2 = rbinom(n, 1, prob = 0.4),
                   L4_3 =  rbinom(n, 1, prob = 0.2),
                   
                   # Drug use: prob based on mean of drug in that year
                   L5_0 = rbinom(n, 1, prob = 0.2),
                   L5_1 = rbinom(n, 1, prob = 0.2),
                   L5_2 = rbinom(n, 1, prob = 0.2),
                   L5_3 = rbinom(n, 1, prob = 0.2),
                   
                   # Alcohol use: based on the mean of alcohol in that year 
                   L6_0 = rbinom(n, 1, prob = 0.75),
                   L6_1 = rbinom(n, 1, prob = 0.75),
                   L6_2 = rbinom(n, 1, prob = 0.75),
                   L6_3 = rbinom(n, 1, prob = 0.75),
                   
                   # Psychological well being: based on mean of psywell in that year 
                   L7_0 = rnorm(n, 3.2, 0.8),
                   L7_1 = rnorm(n, 3.2, 0.8),
                   L7_2 = rnorm(n, 3.5, 0.8),
                   L7_3 = rnorm(n, 3.6, 0.6),
                   
                   # Add in censoring nodes 
                   C1 = rbinom(n, 1, prob = 0.001),
                   C2 = rbinom(n, 1, prob = 0.3),
                   C3 = rbinom(n, 1, prob = 0.4), 
                   C4 = rbinom(n, 1, prob = 0.6)
 )
 
 return(df) 
 }

df <- generate_data_intervene(n = 2000)

df$L1_1 <- ifelse(predict(L1_1_lm, newdata = df) > 0.5, 1, 0)
df$L1_2 <- ifelse(predict(L1_2_lm, newdata = df) > 0.5, 1, 0)
df$L1_3 <- ifelse(predict(L1_3_lm, newdata = df) > 0.5, 1, 0)
df$L1_4 <- ifelse(predict(L1_4_lm, newdata = df) > 0.5, 1, 0)


df <- df %>%
  mutate(Y = L1_1 + L1_2 + L1_3 + L1_4)


# Intervene on cbar == 0 and abar == 1 
df_1111 <- df[,!colnames(df) %in% c("Y")]

df_1111$A1 <- 1
df_1111$A2 <- 1
df_1111$A3 <- 1
df_1111$A4 <- 1
df_1111$C1 <- 0
df_1111$C2 <- 0
df_1111$C3 <- 0
df_1111$C4 <- 0

df_1111$L1_1 <- ifelse(predict(L1_1_lm, newdata = df_1111) > 0.5, 1, 0)
df_1111$L1_2 <- ifelse(predict(L1_2_lm, newdata = df_1111) > 0.5, 1, 0)
df_1111$L1_3 <- ifelse(predict(L1_3_lm, newdata = df_1111) > 0.5, 1, 0)
df_1111$L1_4 <- ifelse(predict(L1_4_lm, newdata = df_1111) > 0.5, 1, 0)
 
df_1111 <- df_1111 %>%
  mutate(Y = L1_1 + L1_2 + L1_3 + L1_4)
 
table(df_1111$Y) 

# Intervene on cbar == 0 and cbar == 0 
df_0000 <- df[,!colnames(df) %in% c("Y")]

df_0000$A1 <- 0
df_0000$A2 <- 0
df_0000$A3 <- 0
df_0000$A4 <- 0
df_0000$C1 <- 0
df_0000$C2 <- 0
df_0000$C3 <- 0
df_0000$C4 <- 0
df_0000$L1_1 <- ifelse(predict(L1_1_lm, newdata = df_0000) > 0.5, 1, 0)
df_0000$L1_2 <- ifelse(predict(L1_2_lm, newdata = df_0000) > 0.5, 1, 0)
df_0000$L1_3 <- ifelse(predict(L1_3_lm, newdata = df_0000) > 0.5, 1, 0)
df_0000$L1_4 <- ifelse(predict(L1_4_lm, newdata = df_0000) > 0.5, 1, 0)

 
df_0000 <- df_0000 %>%
  mutate(Y = L1_1 + L1_2 + L1_3 + L1_4)

table(df_0000$Y) 

Psi.F <- mean(df_1111$Y) - mean(df_0000$Y)
Psi.F

```

Implement tmle: 
```{r ltmle}

# Binarytocensoring
  
  generate_data <- function(n) {
  
 df <- data.frame( # W:
                   W1 = rbinom(n, 1, prob = 0.75) + 1,
                   W2 = rbinom(n, 1, prob = 0.03),
                   W3 = rbinom(n, 1, prob = 0.306) + 1,
                   W4 = rbinom(n, 1, prob = 0.47),
                   W5 = rbinom(n, 1, prob = 0.207),
                   W6 = rbinom(n, 1, prob = 0.412),
   
   # Intervention node: prob based on the mean of victim in that year
                   A1 = rbinom(n, 1, prob = 0.6), 
                   A2 = rbinom(n, 1, prob = 0.4),
                   A3 = rbinom(n, 1, prob = 0.3),
                   A4 = rbinom(n, 1, prob = 0.3),
                   
                   # Suicidal tendency: prob based on the mean of suicide in that year
                   L1_0 = rbinom(n, 1, prob = 0.1),
                   L1_1 = rbinom(n, 1, prob = 0.1),
                   L1_2 = rbinom(n, 1, prob = 0.1),
                   L1_3 = rbinom(n, 1, prob = 0.06),
                   L1_4 = rbinom(n, 1, prob = 0.058),
                  
                   
                  # Relationship status: prob based on the mean of relation in that year
                   L2_0 = rbinom(n, 1, prob = 0.25),
                   L2_1 = rbinom(n, 1, prob = 0.5),
                   L2_2 = rbinom(n, 1, prob = 0.5),
                   L2_3 = rbinom(n, 1, prob = 0.5),
                   
                   # Religious status: prob based on the mean of religion in that year
                   L3_0 = rbinom(n, 1, prob = 0.7),
                   L3_1 = rbinom(n, 1, prob = 0.7),
                   L3_2 = rbinom(n, 1, prob = 0.7),
                   L3_3 = rbinom(n, 1, prob = 0.7),
                   
                   # Victfriend: prob based on the mean of victfriend in that year
                   L4_0 =  rbinom(n, 1, prob = 0.6),
                   L4_1 = rbinom(n, 1, prob = 0.5),
                   L4_2 = rbinom(n, 1, prob = 0.4),
                   L4_3 =  rbinom(n, 1, prob = 0.2),
                   
                   # Drug use: prob based on mean of drug in that year
                   L5_0 = rbinom(n, 1, prob = 0.2),
                   L5_1 = rbinom(n, 1, prob = 0.2),
                   L5_2 = rbinom(n, 1, prob = 0.2),
                   L5_3 = rbinom(n, 1, prob = 0.2),
                   
                   # Alcohol use: based on the mean of alcohol in that year 
                   L6_0 = rbinom(n, 1, prob = 0.75),
                   L6_1 = rbinom(n, 1, prob = 0.75),
                   L6_2 = rbinom(n, 1, prob = 0.75),
                   L6_3 = rbinom(n, 1, prob = 0.75),
                   
                   # Psychological well being: based on mean of psywell in that year 
                   L7_0 = rnorm(n, 3.2, 0.8),
                   L7_1 = rnorm(n, 3.2, 0.8),
                   L7_2 = rnorm(n, 3.5, 0.8),
                   L7_3 = rnorm(n, 3.6, 0.6),
                   
                   # Add in censoring nodes 
                   C1 = rbinom(n, 1, prob = 0.001),
                   C2 = rbinom(n, 1, prob = 0.3),
                   C3 = rbinom(n, 1, prob = 0.4), 
                   C4 = rbinom(n, 1, prob = 0.6)
 )
 
 return(df) 
 }

  df <- generate_data(n = 2000)
  
  df <- df %>%
    mutate(Y = L1_1 + L1_2 + L1_3 + L1_4,
    C1 = BinaryToCensoring(is.censored = C1),
    C2 = BinaryToCensoring(is.censored = C2),
    C3 = BinaryToCensoring(is.censored = C3),
    C4 = BinaryToCensoring(is.censored = C4))

# Order for ltmle
df <- df %>%
  select(W1, W2, W3, W4, W5, W6, L1_0, L2_0, L3_0, L4_0, L5_0, L6_0, L7_0,
         L1_1, L2_1, L3_1, L4_1, L5_1, L6_1, L7_1, C1, A1, 
         L1_2, L2_2, L3_2, L4_2, L5_2, L6_2, L7_2, C2, A2,    
         L1_3, L2_3, L3_3, L4_3, L5_3, L6_3, L7_3, C3, A3,
         C4, A4, Y)


sl_lib <- c("SL.glm", "SL.mean") # SL.step takes 60 minutes

Anodes <- c("A1","A2","A3","A4")

Lnodes <-  c("L1_1", "L2_1", "L3_1", "L4_1", "L5_1", "L6_1", "L7_1", "L1_2", 
            "L2_2", "L3_2", "L4_2", "L5_2", "L6_2", "L7_2", 
            "L1_3", "L2_3", "L3_3", "L4_3", "L5_3", "L6_3", "L7_3")

Cnodes <- c("C1", "C2", "C3", "C4")

# Estimate without G-comp 
est <- ltmle(data = df,
             Anodes = Anodes, 
             Cnodes = Cnodes,
             Lnodes = Lnodes, 
             Ynodes = "Y",
             abar = list(c(1, 1, 1, 1), c(0, 0, 0, 0)),
             gcomp = F,
             SL.library = sl_lib)

# Estimate with G-comp

est1 <- ltmle(data = df,
              Anodes = Anodes,
              Lnodes = Lnodes, 
              Cnodes = Cnodes,
              Ynodes = "Y",
              abar = list(c(1, 1, 1, 1), c(0, 0, 0, 0)),
              gcomp = T,
              SL.library = sl_lib)


```

```{r}
summary(est, "tmle")
summary(est, "iptw")
summary(est1, "gcomp")

data.frame("Estimator" = c("G-Comp", "IPTW", "TMLE"),
           "Estimate" = c("0.0057167", "-0.42626", "-0.30835"),
           "95% CI" = c("(-0.060642, 0.072075)", "(-0.61963, -0.23289) ", "(-0.36678, -0.24992)"))
```

Stability:
```{r}
Psi.F <- 0.054
performance_sim = function() {
  
  n <- 2000
  
  ObsData <-  generate_data(n)
  
   ObsData <- ObsData %>%
    mutate(Y = L1_1 + L1_2 + L1_3 + L1_4,
    C1 = BinaryToCensoring(is.censored = C1),
    C2 = BinaryToCensoring(is.censored = C2),
    C3 = BinaryToCensoring(is.censored = C3),
    C4 = BinaryToCensoring(is.censored = C4))
   
  ObsData <- ObsData %>%
  select(W1, W2, W3, W4, W5, W6, L1_0, L2_0, L3_0, L4_0, L5_0, L6_0, L7_0,
         L1_1, L2_1, L3_1, L4_1, L5_1, L6_1, L7_1, C1, A1, 
         L1_2, L2_2, L3_2, L4_2, L5_2, L6_2, L7_2, C2, A2,    
         L1_3, L2_3, L3_3, L4_3, L5_3, L6_3, L7_3, C3, A3,
         C4, A4, Y)

sl_lib <- c("SL.glm", "SL.mean") # SL.step takes 60 minutes

Anodes <- c("A1","A2","A3","A4")

Lnodes <-  c("L1_1", "L2_1", "L3_1", "L4_1", "L5_1", "L6_1", "L7_1", "L1_2", 
            "L2_2", "L3_2", "L4_2", "L5_2", "L6_2", "L7_2", 
            "L1_3", "L2_3", "L3_3", "L4_3", "L5_3", "L6_3", "L7_3")

Cnodes <- c("C1", "C2", "C3", "C4")
   
  results <- quiet_ltmle(data = ObsData,
             Anodes = Anodes, 
             Cnodes = Cnodes,
             Lnodes = Lnodes, 
             Ynodes = 'Y',
             abar = list(c(1, 1, 1, 1), c(0, 0, 0, 0)),
             gcomp = F,
             SL.library = sl_lib)
  
  results.gcomp <- quiet_ltmle(data = ObsData,
             Anodes = Anodes, 
             Cnodes = Cnodes,
             Lnodes = Lnodes, 
             Ynodes = 'Y',
             abar = list(c(1, 1, 1, 1), c(0, 0, 0, 0)),
             gcomp = T,
             SL.library = sl_lib)
  
  sum.results.tmle = summary(results, "tmle")
  sum.results.iptw = summary(results, "iptw")
  sum.results.gcomp = summary(results.gcomp, "gcomp")
  
  tmle <- sum.results.tmle$effect.measures$ATE$estimate
  
  tmle.cov = sum.results.tmle$effect.measures$ATE$CI[1] < Psi.F &sum.results.tmle$effect.measures$ATE$CI[2] > Psi.F
  
  iptw <- sum.results.iptw$effect.measures$ATE$estimate
  
  iptw.cov = sum.results.iptw$effect.measures$ATE$CI[1] < Psi.F &sum.results.iptw$effect.measures$ATE$CI[2] > Psi.F
  
  gcomp <- sum.results.gcomp$effect.measures$ATE$estimate
  
  gcomp.cov = sum.results.gcomp$effect.measures$ATE$CI[1] < Psi.F &sum.results.gcomp$effect.measures$ATE$CI[2] > Psi.F
  
  estimates <- c(gcomp = gcomp,
                 iptw = iptw,
                 iptw.cov = iptw.cov,
                 tmle = tmle,
                 tmle.cov = tmle.cov)
  
  return(estimates)}

estimates_sim <- t(replicate(100, performance_sim()))
estimates_sim <- data.frame(estimates_sim)

save(estimates_sim, file = "analysis/sim_performance.RData")

#estimates_mat = estimates_sim
#truthvector = Psi.F

#Bias
bias_gcomp <- round(mean(estimates_sim$gcomp) - Psi.F, 4)
bias_iptw <- round(mean(estimates_sim$iptw, na.rm=T) - Psi.F, 4)
bias_tmle <-round(mean(estimates_sim$tmle) - Psi.F, 2)

#variance
var_gcomp <- round(var(estimates_sim$gcomp), 4)
var_iptw <- round(var(estimates_sim$iptw, na.rm = T), 4)
var_tmle <- round(var(estimates_sim$tmle), 4)

#MSE
mse_gcomp <- round(bias_gcomp^2 + var_gcomp, 4)
mse_iptw <- round(bias_iptw^2 + var_iptw, 4)
mse_tmle <- round(bias_tmle^2 + var_tmle, 4)

#Confidence interval
CI_gcomp <- "-"
CI_iptw <- round(mean(estimates_sim$iptw.cov, na.rm=T), 4)
CI_tmle <- round(mean(estimates_sim$tmle.cov), 4)

perf_metrics <- data.frame("Bias" = c(bias_gcomp, bias_iptw, bias_tmle),
                           "Variance" = c(var_gcomp, var_iptw, var_tmle),
                           "MSE" = c(mse_gcomp, mse_iptw, mse_tmle),
                           "Coverage" = c(CI_gcomp, CI_iptw, CI_tmle))
rownames(perf_metrics) <- c("G-Comp", "IPTW", "TMLE")
  

write.csv(perf_metrics, file = ("analysis/sim_metrics.csv"))
write.csv(estimates_sim, file = ("analysis/sim_results.csv"))

```