gerkovink · hanneoberman · Jan 18, 2024 · Jan 18, 2024 · Jan 19, 2024 · Jan 19, 2024
diff --git a/Ad_hoc_and_mice/Ad_hoc_methods.Rmd b/Ad_hoc_and_mice/Ad_hoc_methods.Rmd
@@ -19,11 +19,12 @@ No previous experience with `R` is required.
 
 ---
 
-**1. Open `R` and load the packages `mice` and `lattice`**
+**1. Open `R` and load the packages `mice`, `ggmice`, and `ggplot2`**
 
-```{r, message=FALSE, warning=FALSE}
-require(mice)
-require(lattice)
+```{r setup, message=FALSE, warning=FALSE}
+library(mice)
+library(ggmice)
+library(ggplot2)
 set.seed(123)
 ```
 
@@ -65,7 +66,7 @@ summary(nhanes)
 
  Check the missingness pattern for the `nhanes` dataset
 ```{r, cache = FALSE}
-md.pattern(nhanes)
+plot_pattern(nhanes)
 ```
 The missingness pattern shows that there are 27 missing values in total: 10 for `chl` , 9 for `bmi` and 8 for `hyp`. Moreover, there are thirteen completely observed rows, four rows with 1 missing, one row with 2 missings and seven rows with 3 missings. Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.
 
@@ -110,9 +111,15 @@ fit <- with(imp, lm(age ~ bmi))
 summary(fit)
 ```
 It is clear that nothing changed, but then again this is not surprising as variable `bmi` is somewhat normally distributed and we are just adding weight to the mean.
+
+The imputed data can be inspected visually by comparing the distribution of the observed data to the distribution of the imputed data:
 ```{r}
-densityplot(nhanes$bmi)
+ggmice(nhanes, aes(bmi)) +
+        geom_histogram(fill = "white")
+ggmice(imp, aes(bmi)) + 
+        geom_histogram(fill = "white")
 ```
+Every missing instance of `bmi` is filled in with the average observed `bmi` value.
 
 ---
 
@@ -139,6 +146,13 @@ summary(fit)
 ```
 It is clear that something has changed. In fact, we extrapolated (part of) the regression model for the observed data to missing data in `bmi`. In other words; the relation (read: information) gets stronger and we've obtained more observations. 
 
+The imputed data can be inspected visually with:
+```{r}
+ggmice(imp, aes(bmi)) + 
+  geom_histogram(fill = "white")
+```
+The imputed values fall within the range of observed `bmi` values.
+
 ---
 
 
@@ -163,12 +177,19 @@ fit <- with(imp, lm(age ~ bmi))
 summary(fit)
 ```
 
+The imputed data can be inspected visually with:
+```{r}
+ggmice(imp, aes(bmi)) + 
+  geom_histogram(fill = "white")
+```
+This imputation method better reflects the uncertainty in the imputed data than previous methods.
+
 ---
 
 **12. Re-run the stochastic imputation model with seed `123` and verify if your results are the same as the ones below**
 
 ```{r, echo=FALSE, warning=FALSE, message=FALSE}
-imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123, print=F)
+imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123, print = FALSE)
 fit <- with(imp, lm(age ~ bmi))
 summary(fit)
 ```
@@ -217,7 +238,7 @@ imp$imp
 By default, `mice()` calculates five (*m* = 5) imputed data sets. In order to get the third imputed data set, use the `complete()` function
 ```{r, cache=TRUE}
 c3 <- complete(imp, 3) 
-md.pattern(c3)
+plot_pattern(c3)
 ```
 The collection of the $m$ imputed data sets can be exported by function
 `complete()` in long, broad and repeated formats. 

diff --git a/Combining_inferences/Combining inferences.Rmd b/Combining_inferences/Combining inferences.Rmd
@@ -51,11 +51,12 @@ All the best,
 ---
 
 **1. Open `R` and load the following packages and fix the random seed.**
-```{r message=FALSE, warning=FALSE}
-library(mice) # Data imputation
-library(dplyr) # Data manipulation
-library(magrittr) # Flexible piping in R
-library(purrr) # Flexible functional programming
+```{r setup, message=FALSE, warning=FALSE}
+library(mice)     # Data imputation
+library(ggmice)   # Data visualization
+library(ggplot2)  # Data visualization
+library(dplyr)    # Data manipulation
+library(purrr)    # Flexible functional programming
 set.seed(123)
 ```
 We choose seed value `123`. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your `R` instance to `123`, you will get the exact same results and plots as we present in this document if you follow the order and code of the exercises precisely. 
@@ -82,9 +83,15 @@ pred <- make.predictorMatrix(boys)
 pred[c("hgt", "wgt"), "bmi"] <- 0
 pred
 ```
-and we run the `mice` algorithm again with the new predictor matrix (we still 'borrow' the imputation methods object `meth` from before)
+
+The predictor matrix can be visualized with
+```{r}
+plot_pred(pred, method = meth)
+```
+
+We run the `mice` algorithm again with the new predictor matrix (we still 'borrow' the imputation methods object `meth` from before)
 ```{r}
-imp <-mice(boys, 
+imp <- mice(boys, 
            meth = meth, 
            pred = pred, 
            print = FALSE, 
@@ -110,18 +117,18 @@ Quite often people are suggesting that using the average imputed dataset - so ta
 
 To demonstrate this, let's ceate the averaged data set and exclude the non-numerical columns:
 ```{r warning=FALSE}
-ave <- imp %>%
-  mice::complete("long") %>%
-  group_by(.id) %>%
-  summarise_all(.funs = mean) %>%
+ave <- imp |>
+  mice::complete("long") |>
+  group_by(.id) |>
+  summarise_all(.funs = mean) |>
   select(-.id, -.imp, -phb, -gen, -reg)
 
 head(ave)
 ```
 If we now calculate Pearson's correlation, rounded to two digits:
 ```{r}
-cor.wrong <- ave %>%
-  cor() %>%
+cor.wrong <- ave |>
+  cor() |>
   round(digits = 2)
 ```
 we obtain:
@@ -140,10 +147,10 @@ fisher.backtrans <- function(x) (exp(2 * x) - 1) / (exp(2 * x) + 1)
 
 Now, to calculate the correlation on the imputed data
 ```{r}
-cor <- imp %>%
-  mice::complete("all") %>%
-  map(select, -phb, -gen, -reg) %>%  
-  map(stats::cor) %>%
+cor <- imp |>
+  mice::complete("all") |>
+  map(select, -phb, -gen, -reg) |>  
+  map(stats::cor) |>
   map(fisher.trans)
 cor
 ```
@@ -199,7 +206,7 @@ So, please stay away from averaging the imputed data sets. Instead, use the corr
 - `lm(age ~ wgt + hgt)`
 
 ```{r}
-fit1.lm <- imp %>%
+fit1.lm <- imp |>
   with(lm(age ~ wgt + hgt))
 
 est1.lm <- pool(fit1.lm)
@@ -213,7 +220,7 @@ summary(est1.lm)
 
 - `lm(age ~ wgt + hgt + I(hgt^2))`
 ```{r}
-fit2.lm <- imp %>%
+fit2.lm <- imp |>
   with(lm(age ~ wgt + hgt + I(hgt^2)))
 
 est2.lm <- pool(fit2.lm)
@@ -255,7 +262,7 @@ For large sample size, $D_3$ is equivalent to $D_1$, however, $D_3$ does not req
 
 ---
 
-**7. Fit a stepwise linear model to predict `hgt` seperately to each of the imputed data sets.**
+**7. Fit a stepwise linear model to predict `hgt` separately to each of the imputed data sets.**
 
 We can use the `step()` function in `R` to select formula-based models. We start by specifying the scope of the evaluations:
 ```{r}
@@ -295,10 +302,10 @@ D1(fit.gen, fit.nogen)
 ```
 With a p-value of `.059` we could conclude that `gen` is not strictly needed in this model. We might also investigate the BIC on the seperate imputed sets and compare those across (not within) models. We draw the same conclusion from this evaluation - the BIC is lower for the model without `gen` - although not by much. But then again, the p-value indicated the same trend. 
 ```{r}
-BIC.gen <- fit.gen$analyses %>%
+BIC.gen <- fit.gen$analyses |>
   sapply(BIC) 
 
-BIC.nogen <- fit.nogen$analyses %>%
+BIC.nogen <- fit.nogen$analyses |>
   sapply(BIC)
 ```
 
@@ -320,10 +327,10 @@ Please not that we can compare the BIC only over the models, not over the impute
 
 To study the means for `bmi` conditionally on `reg` to get a picture of what the differences are:
 ```{r}
-imp %>%
-  mice::complete("long") %>%
-  select(reg, bmi) %>%
-  group_by(reg) %>%
+imp |>
+  mice::complete("long") |>
+  select(reg, bmi) |>
+  group_by(reg) |>
   summarise_all(.funs = mean)
 ```
 
@@ -335,10 +342,10 @@ se <- function(x){
 ```
 and then calculate the summary again
 ```{r}
-imp %>%
-  mice::complete("long") %>%
-  select(reg, bmi) %>%
-  group_by(reg) %>%
+imp |>
+  mice::complete("long") |>
+  select(reg, bmi) |>
+  group_by(reg) |>
   summarise_all(list(~ mean(.), ~se(.)))
 ```
 
@@ -352,14 +359,14 @@ imp %>%
 
 This is best done with an ANOVA. To do this correctly, we can apply the following workflow. First, we fit an intercept only model
 ```{r}
-fit.empty <- imp %>%
-  mice::complete("all") %>%
+fit.empty <- imp |>
+  mice::complete("all") |>
   map(lm, formula = bmi ~ 1)
 ```
 and then we fit the model with `reg` as a predictor
 ```{r}
-fit.reg <- imp %>%
-  mice::complete("all") %>%
+fit.reg <- imp |>
+  mice::complete("all") |>
   map(lm, formula = bmi ~ 1 + reg)
 ```
 
@@ -393,6 +400,6 @@ And find that indeed the overall (i.e. pooled) effect over the imputations is al
 ---
 
 
-**- End of practical**
+**- End of vignette**
 
 ---