Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace lattice graphs by ggmice alternatives #11

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions Ad_hoc_and_mice/Ad_hoc_methods.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,12 @@ No previous experience with `R` is required.

---

**1. Open `R` and load the packages `mice` and `lattice`**
**1. Open `R` and load the packages `mice`, `ggmice`, and `ggplot2`**

```{r, message=FALSE, warning=FALSE}
require(mice)
require(lattice)
```{r setup, message=FALSE, warning=FALSE}
library(mice)
library(ggmice)
library(ggplot2)
set.seed(123)
```

Expand Down Expand Up @@ -65,7 +66,7 @@ summary(nhanes)

Check the missingness pattern for the `nhanes` dataset
```{r, cache = FALSE}
md.pattern(nhanes)
plot_pattern(nhanes)
```
The missingness pattern shows that there are 27 missing values in total: 10 for `chl` , 9 for `bmi` and 8 for `hyp`. Moreover, there are thirteen completely observed rows, four rows with 1 missing, one row with 2 missings and seven rows with 3 missings. Looking at the missing data pattern is always useful (but may be difficult for datasets with many variables). It can give you an indication on how much information is missing and how the missingness is distributed.

Expand Down Expand Up @@ -110,9 +111,15 @@ fit <- with(imp, lm(age ~ bmi))
summary(fit)
```
It is clear that nothing changed, but then again this is not surprising as variable `bmi` is somewhat normally distributed and we are just adding weight to the mean.

The imputed data can be inspected visually by comparing the distribution of the observed data to the distribution of the imputed data:
```{r}
densityplot(nhanes$bmi)
ggmice(nhanes, aes(bmi)) +
geom_histogram(fill = "white")
ggmice(imp, aes(bmi)) +
geom_histogram(fill = "white")
```
Every missing instance of `bmi` is filled in with the average observed `bmi` value.

---

Expand All @@ -139,6 +146,13 @@ summary(fit)
```
It is clear that something has changed. In fact, we extrapolated (part of) the regression model for the observed data to missing data in `bmi`. In other words; the relation (read: information) gets stronger and we've obtained more observations.

The imputed data can be inspected visually with:
```{r}
ggmice(imp, aes(bmi)) +
geom_histogram(fill = "white")
```
The imputed values fall within the range of observed `bmi` values.

---


Expand All @@ -163,12 +177,19 @@ fit <- with(imp, lm(age ~ bmi))
summary(fit)
```

The imputed data can be inspected visually with:
```{r}
ggmice(imp, aes(bmi)) +
geom_histogram(fill = "white")
```
This imputation method better reflects the uncertainty in the imputed data than previous methods.

---

**12. Re-run the stochastic imputation model with seed `123` and verify if your results are the same as the ones below**

```{r, echo=FALSE, warning=FALSE, message=FALSE}
imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123, print=F)
imp <- mice(nhanes, method = "norm.nob", m = 1, maxit = 1, seed = 123, print = FALSE)
fit <- with(imp, lm(age ~ bmi))
summary(fit)
```
Expand Down Expand Up @@ -217,7 +238,7 @@ imp$imp
By default, `mice()` calculates five (*m* = 5) imputed data sets. In order to get the third imputed data set, use the `complete()` function
```{r, cache=TRUE}
c3 <- complete(imp, 3)
md.pattern(c3)
plot_pattern(c3)
```
The collection of the $m$ imputed data sets can be exported by function
`complete()` in long, broad and repeated formats.
Expand Down
77 changes: 42 additions & 35 deletions Combining_inferences/Combining inferences.Rmd
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,12 @@ All the best,
---

**1. Open `R` and load the following packages and fix the random seed.**
```{r message=FALSE, warning=FALSE}
library(mice) # Data imputation
library(dplyr) # Data manipulation
library(magrittr) # Flexible piping in R
library(purrr) # Flexible functional programming
```{r setup, message=FALSE, warning=FALSE}
library(mice) # Data imputation
library(ggmice) # Data visualization
library(ggplot2) # Data visualization
library(dplyr) # Data manipulation
library(purrr) # Flexible functional programming
set.seed(123)
```
We choose seed value `123`. This is an arbitrary value; any value would be an equally good seed value. Fixing the random seed enables you (and others) to exactly replicate anything that involves random number generators. If you set the seed in your `R` instance to `123`, you will get the exact same results and plots as we present in this document if you follow the order and code of the exercises precisely.
Expand All @@ -82,9 +83,15 @@ pred <- make.predictorMatrix(boys)
pred[c("hgt", "wgt"), "bmi"] <- 0
pred
```
and we run the `mice` algorithm again with the new predictor matrix (we still 'borrow' the imputation methods object `meth` from before)

The predictor matrix can be visualized with
```{r}
plot_pred(pred, method = meth)
```

We run the `mice` algorithm again with the new predictor matrix (we still 'borrow' the imputation methods object `meth` from before)
```{r}
imp <-mice(boys,
imp <- mice(boys,
meth = meth,
pred = pred,
print = FALSE,
Expand All @@ -110,18 +117,18 @@ Quite often people are suggesting that using the average imputed dataset - so ta

To demonstrate this, let's ceate the averaged data set and exclude the non-numerical columns:
```{r warning=FALSE}
ave <- imp %>%
mice::complete("long") %>%
group_by(.id) %>%
summarise_all(.funs = mean) %>%
ave <- imp |>
mice::complete("long") |>
group_by(.id) |>
summarise_all(.funs = mean) |>
select(-.id, -.imp, -phb, -gen, -reg)

head(ave)
```
If we now calculate Pearson's correlation, rounded to two digits:
```{r}
cor.wrong <- ave %>%
cor() %>%
cor.wrong <- ave |>
cor() |>
round(digits = 2)
```
we obtain:
Expand All @@ -140,10 +147,10 @@ fisher.backtrans <- function(x) (exp(2 * x) - 1) / (exp(2 * x) + 1)

Now, to calculate the correlation on the imputed data
```{r}
cor <- imp %>%
mice::complete("all") %>%
map(select, -phb, -gen, -reg) %>%
map(stats::cor) %>%
cor <- imp |>
mice::complete("all") |>
map(select, -phb, -gen, -reg) |>
map(stats::cor) |>
map(fisher.trans)
cor
```
Expand Down Expand Up @@ -199,7 +206,7 @@ So, please stay away from averaging the imputed data sets. Instead, use the corr
- `lm(age ~ wgt + hgt)`

```{r}
fit1.lm <- imp %>%
fit1.lm <- imp |>
with(lm(age ~ wgt + hgt))

est1.lm <- pool(fit1.lm)
Expand All @@ -213,7 +220,7 @@ summary(est1.lm)

- `lm(age ~ wgt + hgt + I(hgt^2))`
```{r}
fit2.lm <- imp %>%
fit2.lm <- imp |>
with(lm(age ~ wgt + hgt + I(hgt^2)))

est2.lm <- pool(fit2.lm)
Expand Down Expand Up @@ -255,7 +262,7 @@ For large sample size, $D_3$ is equivalent to $D_1$, however, $D_3$ does not req

---

**7. Fit a stepwise linear model to predict `hgt` seperately to each of the imputed data sets.**
**7. Fit a stepwise linear model to predict `hgt` separately to each of the imputed data sets.**

We can use the `step()` function in `R` to select formula-based models. We start by specifying the scope of the evaluations:
```{r}
Expand Down Expand Up @@ -295,10 +302,10 @@ D1(fit.gen, fit.nogen)
```
With a p-value of `.059` we could conclude that `gen` is not strictly needed in this model. We might also investigate the BIC on the seperate imputed sets and compare those across (not within) models. We draw the same conclusion from this evaluation - the BIC is lower for the model without `gen` - although not by much. But then again, the p-value indicated the same trend.
```{r}
BIC.gen <- fit.gen$analyses %>%
BIC.gen <- fit.gen$analyses |>
sapply(BIC)

BIC.nogen <- fit.nogen$analyses %>%
BIC.nogen <- fit.nogen$analyses |>
sapply(BIC)
```

Expand All @@ -320,10 +327,10 @@ Please not that we can compare the BIC only over the models, not over the impute

To study the means for `bmi` conditionally on `reg` to get a picture of what the differences are:
```{r}
imp %>%
mice::complete("long") %>%
select(reg, bmi) %>%
group_by(reg) %>%
imp |>
mice::complete("long") |>
select(reg, bmi) |>
group_by(reg) |>
summarise_all(.funs = mean)
```

Expand All @@ -335,10 +342,10 @@ se <- function(x){
```
and then calculate the summary again
```{r}
imp %>%
mice::complete("long") %>%
select(reg, bmi) %>%
group_by(reg) %>%
imp |>
mice::complete("long") |>
select(reg, bmi) |>
group_by(reg) |>
summarise_all(list(~ mean(.), ~se(.)))
```

Expand All @@ -352,14 +359,14 @@ imp %>%

This is best done with an ANOVA. To do this correctly, we can apply the following workflow. First, we fit an intercept only model
```{r}
fit.empty <- imp %>%
mice::complete("all") %>%
fit.empty <- imp |>
mice::complete("all") |>
map(lm, formula = bmi ~ 1)
```
and then we fit the model with `reg` as a predictor
```{r}
fit.reg <- imp %>%
mice::complete("all") %>%
fit.reg <- imp |>
mice::complete("all") |>
map(lm, formula = bmi ~ 1 + reg)
```

Expand Down Expand Up @@ -393,6 +400,6 @@ And find that indeed the overall (i.e. pooled) effect over the imputations is al
---


**- End of practical**
**- End of vignette**

---
Loading