-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathSEM.qmd
603 lines (372 loc) · 28.3 KB
/
SEM.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
---
title: "Ch. 5: Structural Equation Models"
format: html
author: Moritz Fischer
execute:
cache: true
project-type: website
---
*Reading/working time: ~50 min.*
```{r install packages, message=FALSE, warning=FALSE}
#install.packages(c("lavaan", "ggplot2", "Rfast", "MBESS", "future.apply"), dependencies = TRUE)
library("lavaan")
library("Rfast")
library("ggplot2")
library("dplyr")
library("MBESS")
library(future.apply)
# this initializes parallel processing, for faster code execution
# By default, it uses all available cores
plan(multisession)
```
In this chapter, we will focus on some rather simple Structural Equation Models (SEM). The goal is to illustrate how simulations can be used to estimate statistical power to detect a given effect in a SEM. In the context of SEMs, the focal effect may be for instance a fit index (e.g., Chi-Square, Root Mean Square Error of Approximation, etc.) or a model coefficient (e.g., a regression coefficient for the association between two latent factors). In this chapter, we only focus on the latter, that is power analyses for regression coefficients in the context of SEM. Please see the bonus chapter titled "[Power analysis for fit indices in SEM](SEM_fit_index.qmd)" if you're interested in power analyses for fit indices.
::: {.callout-note}
In this chapter, we'll use the parallelization techniques described in the Chapter ["Bonus: Optimizing R code for speed"](optimizing_code.qmd), otherwise the simulations are too slow. If you wonder about the unknown commands in the code, please read the bonus chapter to learn how to considerably speed up your code! But nonetheless, running some of the chunks in this chapter will require quite some time. We therefore decided to use 500 iterations per simulation only in this chapter; please keep in mind that you should use more iterations in order to obtain more precise estimates!
:::
# A simulation-based power analysis for a single regression coefficient between latent variables
Let's consider the following example: We are planning to conduct a study investigating whether people's openness to experience (as conceptualized in the big five personality traits) relates negatively to generalized prejudice, that is a latent variable comprising prejudice towards different social groups (e.g., women, foreigners, homosexuals, disabled people). We could plan to scrutinize this prediction with a SEM in which both openness to experience and generalized prejudice are modeled as latent variables.
::: {.callout-note}
Please note that the predictions used as examples in this chapter are not necessarily theoretically meaningful hypotheses, we merely use them to illustrate the mechanics of simulation-based power analyses in the context of SEM.
:::
## Let's get some real data as starting point
Just like for any other simulation-based power analysis, we first need to come up with plausible estimates of the distribution of the (manifest) variables. For the sake of simplicity, let's assume that there is a published study that measured manifestations of our two latent variables and that the corresponding data set is publicly available. For the purpose of this tutorial, we will draw on a publication by Bergh et al. ([2016](https://www.researchgate.net/publication/306939541_Is_group_membership_necessary_for_understanding_generalized_prejudice_A_re-evaluation_of_why_prejudices_are_interrelated)) and the corresponding data set which has been made accessible as part of the `MPsychoR` package. Let's take a look at this data set.
```{r load data, message=FALSE}
#install.packages("MPsychoR")
library(MPsychoR)
data("Bergh")
attach(Bergh)
#let's take a look
head(Bergh)
tail(Bergh)
```
This data set comprises `r ncol(Bergh)` variables measured in `r nrow(Bergh)` participants. For now, we will focus on the following measured variables:
- `EP` is a continuous variable measuring ethnic prejudice.
- `SP` is a continuous variable measuring sexism.
- `HP` is a continuous variable measuring sexual prejudice towards gays and lesbians.
- `DP` is a continuous variable measuring prejudice toward people with disabilities.
- `O1`, `O2`, and `O3` are three items measuring openness to experience.
To get an impression of this data, we look at the correlations of the variables we're interested in.
```{r correlations SEM}
cor(cbind(EP, SP, HP, DP, O1, O2, O3)) |> round(2)
```
As we have discussed in the previous chapters, the starting point of every simulation-based power analysis is to specify the population parameters of the variables of interest. In our example, we can estimate the population parameters from the study by Bergh et al. (2016). We start by calculating the means of the variables, rounding them generously, and storing them in a vector called `means_vector`.
```{r mean vector SEM}
#store means
means_vector <- c(mean(EP), mean(SP), mean(HP), mean(DP), mean(O1), mean(O2), mean(O3)) |> round(2)
#Let's take a look
means_vector
```
We also need the variance-covariance matrix of our variables in order to simulate data. Luckily, we can estimate this from the Bergh et al. data as well. There are two ways to do this. First, we can use the `cov` function to obtain the variance-covariance matrix. This matrix incorporates the variance of each variable on the diagonal and the covariances in the remaining cells.
```{r cov matrix SEM}
#store covariances
cov_mat <- cov(cbind(EP, SP, HP, DP, O1, O2, O3)) |> round(2)
#Let's take a look
cov_mat
```
This works well as long as we have a data set (e.g., from a pilot study or published work) to estimate the variances and covariances. In other cases, however, we might not have access to such a data set. In this case, we might only have a correlation table that was provided in a published paper. But that's no problem either, as we can transform the correlations and standard deviations of the variables of interest into a variance-covariance matrix. The following chunk shows how this works by using the `cor2cov` function from the `MBESS` package.
```{r sd vector and correlations SEM}
#store correlation matrix
cor_mat <- cor(cbind(EP, SP, HP, DP, O1, O2, O3))
#store standard deviations
sd_vector <- c(sd(EP), sd(SP), sd(HP), sd(DP), sd(O1), sd(O2), sd(O3))
#transform correlations and standard deviations into variance-covariance matrix
cov_mat2 <- MBESS::cor2cov(cor.mat = cor_mat, sd = sd_vector) |> as.data.frame() |> round(2)
#Let's take a look
cov_mat2
```
Let's do a plausibility check: Did the two ways to estimate the variance-covariance matrix lead to the same results?
```{r plausibility check SEM}
cov_mat == cov_mat2
```
Indeed, this worked! Both procedures lead to the exact same variance-covariance matrix. Now that we have an approximation of the variance-covariance matrix, we use the `rmvnorm` function from the `Rfast` package to simulate data from a multivariate normal distribution. The following code simulates `n = 50` observations from the specified population.
```{r simulate data SEM}
#Set seed to make results reproducible
set.seed(21364)
#simulate data
my_first_simulated_data <- Rfast::rmvnorm(n = 50, mu=means_vector, sigma = cov_mat) |> as.data.frame()
#Let's take a look
head(my_first_simulated_data)
```
We could now fit a SEM to this simulated data set and check whether the regression coefficient modelling the association between openness to experience and generalized prejudice is significant at an $\alpha$-level of .005. We will work with the `lavaan` package to fit SEMs.
```{r analyze data SEM}
#specify SEM
model_sem <- "generalized_prejudice =~ EP + DP + SP + HP
openness =~ O1 + O2 + O3
generalized_prejudice ~ openness"
#fit the SEM to the simulated data set
fit_sem <- sem(model_sem, data = my_first_simulated_data)
#display the results
summary(fit_sem)
```
The results show that in this case, the regression coefficient is `r lavaan::parameterestimates(fit_sem)[8,]$est |> round(2)` which is significant with p = `r lavaan::parameterestimates(fit_sem)[8,]$pvalue |> round(3)`. But, actually, it is not our primary interest to see whether this particular simulated data set results in a significant regression coefficient. Rather, we want to know how many of a theoretically infinite number of simulations yield a significant p-value of the focal regression coefficient. Thus, as in the previous chapters, we now repeatedly simulate data sets of a certain size (say, 50 observations) from the specified population and store the results of the focal test (here: the p-value of the regression coefficient) in a vector called `p_values`.
```{r multiple iterations SEM, warning=FALSE}
#Set seed to make results reproducible
set.seed(21364)
#let's do 500 iterations
iterations <- 500
#prepare an empty NA vector with 500 slots
p_values <- rep(NA, iterations)
#sample size per iteration
n <- 50
#simulate data
for(i in 1:iterations){
simulated_data <- Rfast::rmvnorm(n = n, mu = means_vector, sigma = cov_mat) |> as.data.frame()
fit_sem_simulated <- sem(model_sem, data = simulated_data)
p_values[i] <- parameterestimates(fit_sem_simulated)[8,]$pvalue
}
```
How many of our 500 virtual samples would have found a significant p-value (i.e., p < .005)?
```{r results SEM}
#frequency table
table(p_values < .005)
#percentage of significant results
sum(p_values < .005)/iterations*100
```
Only `r round((sum(p_values < .005)*100/iterations),2)`% of samples with the same size of $n=50$ result in a significant p-value. We conclude that $n=50$ observations seems to be insufficient, as the power with these parameters is lower than 80%.
## Sample size planning: Find the necessary sample size
But how many observations do we need to find the presumed effect with a power of 80%? Like before, we can now systematically vary certain parameters (e.g., sample size) of our simulation and see how that affects power. We could, for example, vary the sample size in a range from 30 to 200. Running these simulations typically requires quite some computing time.
```{r power analysis SEM, warning=FALSE}
#Set seed to make results reproducible
set.seed(21364)
#test ns between 30 and 200
ns_sem <- seq(30, 200, by=10)
#prepare empty vector to store results
result_sem <- data.frame()
#set number of iterations
iterations_sem <- 500
#write function
sim_sem <- function(n, model, mu, sigma) {
simulated_data <- Rfast::rmvnorm(n = n, mu = mu, sigma = sigma) |> as.data.frame()
fit_sem_simulated <- sem(model, data = simulated_data)
p_value_sem <- parameterestimates(fit_sem_simulated)[8,]$pvalue
return(p_value_sem)
}
#replicate function with varying ns
for (n in ns_sem) {
p_values_sem <- future_replicate(iterations_sem, sim_sem(n = n, model = model_sem, mu = means_vector, sigma = cov_mat), future.seed=TRUE)
result_sem <- rbind(result_sem, data.frame(
n = n,
power = sum(p_values_sem < .005)/iterations_sem)
)
#The following line of code can be used to track the progress of the simulations
#This can be helpful for simulations with a high number of iterations and/or a large parameter space which require a lot of time
#I have deactivated this here; to enable it, just remove the "#" sign at the beginning of the next line
#message(paste("Progress info: Simulations completed for n =", n))
}
```
Let's plot the results:
```{r plot power curve SEM, warning=FALSE}
ggplot(result_sem, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")
```
This graph suggests that we need a sample size of approximately 50 participants to reach a power of 80% with the given population estimates. That's all it takes to run a power analysis for a SEM!
In the following two sub-paragraphs, we would like to present two alternatives and/or extensions to this first way of doing a power analysis for a SEM. First, we would like to present an alternative way to simulate data for SEMs, i.e., by using a built-in function in the `lavaan` package. Second, we would like to show how a "safeguard" approach to power analysis can be used within the context of SEM.
## Using the lavaan syntax to simulate data
In our opening example in this chapter, we used the `rmvnorm` function from the `Rfast`package to simulate data based on the means of the manifest variables as well as their variance-covariance matrix. An alternative to this procedure is to use a built-in function in the `lavaan` package, that is, the `simulateData()` function. The main idea here is to provide this function with a lavaan model that specifies all relevant population parameters and then to use this function to directly simulate data.
More specifically, we need to incorporate the factor loadings, regression coefficients and (residual) variances of all latent and manifest variables in the lavaan syntax. As these parameters are hardly ever known without a previous study, we will again draw on the results from the data by Bergh et al. (2016). Let's again take a look at the results of our SEM when applying it to this data set.
```{r fit Bergh}
#fit the SEM to the pilot data set
fit_bergh <- sem(model_sem, data = Bergh)
#display the results
summary(fit_bergh)
```
In order to use the `simulateData()` function, we can take the estimates from this previous study and plug them into the lavaan syntax in the following chunk.
```{r specify lavaan syntax}
#Set seed to make results reproducible
set.seed(21364)
#specify SEM
model_fully_specified <-
"generalized_prejudice =~ 1*EP + 0.71*DP + 0.91*SP + 1*HP
openness =~ 1*O1 + 0.93*O2 + 1.14*O3
generalized_prejudice ~ -0.77*openness
generalized_prejudice ~~ 0.19*generalized_prejudice
openness ~~ 0.16*openness
EP ~~ 0.21*EP
DP ~~ 0.14*DP
SP ~~ 0.23*SP
HP ~~ 2.12*HP
O1 ~~ 0.07*O1
O2 ~~ 0.08*O2
O3 ~~ 0.05*O3
"
#lets try this
sim_lavaan <- simulateData(model_fully_specified, sample.nobs=100)
head(sim_lavaan)
```
The next step is to integrate this code into our first simulation-based power analysis in this chapter, that is, to replace the data simulation process using the `rmvnorm` function from the `Rfast` package with this new method using `simulateData()` from the `lavaan` package. To this end, I am adapting the `power analysis SEM` chunk from above accordingly.
```{r simulateData, warning=FALSE}
#Set seed to make results reproducible
set.seed(21364)
#test ns between 30 and 200
ns_sem <- seq(30, 200, by=10)
#prepare empty vector to store results
result_sem <- data.frame()
#set number of iterations
iterations_sem <- 500
#write function
sim_sem_lavaan <- function(n, model) {
sim_lavaan <- simulateData(model = model, sample.nobs=n)
fit_sem_simulated <- sem(model_sem, data = sim_lavaan)
p_value_sem <- parameterestimates(fit_sem_simulated)[8,]$pvalue
return(p_value_sem)
}
#replicate function with varying ns
for (n in ns_sem) {
p_values_sem <- future_replicate(iterations_sem, sim_sem_lavaan(n = n, model = model_fully_specified), future.seed=TRUE)
result_sem <- rbind(result_sem, data.frame(
n = n,
power = sum(p_values_sem < .005, na.rm = TRUE)/iterations_sem)
)
#The following line of code can be used to track the progress of the simulations
#This can be helpful for simulations with a high number of iterations and/or a large parameter space which require a lot of time
#I have deactivated this here; to enable it, just remove the "#" sign at the beginning of the next line
#message(paste("Progress info: Simulations completed for n =", n))
}
```
Let's plot this again.
```{r plot power curve SEM lavaan, warning=FALSE}
ggplot(result_sem, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")
```
Here, we conclude that approx. 55 participants would be needed to achieve 80% power. The slight difference compared to the previous power analysis should be explained by the fact that we rounded the numbers that define our statistical populations and that we only used 500 Monte Carlo iterations -- these differences should decrease with an increasing number of iterations.
::: {.callout-tip}
Wang and Rhemtulla ([2021](https://doi.org/10.1177/2515245920918253)) developed a shiny app that can do power analyses for SEMs in a similar fashion, but that additionally provides a point-and-click interface. You can use it here, for instance, to replicate the results of this simulation-based power analysis: [https://yilinandrewang.shinyapps.io/pwrSEM/](https://yilinandrewang.shinyapps.io/pwrSEM/).
:::
## A safeguard power approach for SEMs
Of note, the two power analyses for our SEM we have conducted so far used the observed effect size from the Bergh et al. (2016) data set as an estimate of the "true" effect size. But, as this point estimate may be imprecise (e.g., because of publication bias), it seems reasonable to use a more conservative estimate of the true effect size. One more conservative approach in this context is the safeguard power approach (Perugini et al., [2014](https://doi.org/10.1177/1745691614528519)), which we have already applied in Chapter 1 ([Linear Model 1: A single dichotomous predictor](LM1.qmd)).
Basically, all need to do in order to account for variability of observed effect sizes is to calculate a 60%-confidence interval around the point estimate of the observed effect size from our pilot data and to use the more conservative bound of this confidence interval (here: the upper bound) as our new effect size estimate. This can be easily done with the `parameterestimates` function from the `lavaan` package which takes the level of the confidence interval as an input parameter. Let's use this function on our object `fit_bergh` which stores the results of our SEM in the `Bergh` data set.
```{r safeguard estimation}
parameterestimates(fit_bergh, level = .60)
```
This output shows that the upper bound of the 60% confidence interval around the focal regression coefficient is `r lavaan::parameterestimates(fit_bergh, level = .60)[8,]$ci.upper |> round(2)`. We can now use this as our new and more conservative effect size estimate. We can for example insert this value into our simulation using the lavaan syntax. For this purpose, we copy the chunk from above and simply replace the previous effect size estimate (`r lavaan::parameterestimates(fit_bergh)[8,]$est |> round(2)`) with the new estimate (`r lavaan::parameterestimates(fit_bergh, level = .60)[8,]$ci.upper |> round(2)`), while keeping all other parameters that define this data set.
```{r lavaan model safeguard power}
#Set seed to make results reproducible
set.seed(21364)
#specify SEM
model_fully_specified_safeguard <-
"generalized_prejudice =~ 1*EP + 0.71*DP + 0.91*SP + 1*HP
openness =~ 1*O1 + 0.93*O2 + 1.14*O3
generalized_prejudice ~ -0.72*openness
generalized_prejudice ~~ 0.19*generalized_prejudice
openness ~~ 0.16*openness
EP ~~ 0.21*EP
DP ~~ 0.14*DP
SP ~~ 0.23*SP
HP ~~ 2.12*HP
O1 ~~ 0.07*O1
O2 ~~ 0.08*O2
O3 ~~ 0.05*O3
"
#lets try this
sim_lavaan_safeguard <- simulateData(model_fully_specified_safeguard, sample.nobs=100)
head(sim_lavaan_safeguard)
```
Now, everything is ready for the actual safeguard power analysis. We can re-use the `sim_sem_lavaan` function we have defined above. Let's see what we get here!
```{r safeguard power analysis, warning=FALSE}
#Set seed to make results reproducible
set.seed(21364)
#prepare empty vector to store results
result_sem_safeguard <- data.frame()
#replicate function with varying ns
for (n in ns_sem) {
p_values_sem_safeguard <- future_replicate(iterations_sem, sim_sem_lavaan(n = n, model = model_fully_specified_safeguard), future.seed=TRUE)
result_sem_safeguard <- rbind(result_sem_safeguard, data.frame(
n = n,
power = sum(p_values_sem_safeguard < .005, na.rm = TRUE)/iterations_sem)
)
#The following line of code can be used to track the progress of the simulations
#This can be helpful for simulations with a high number of iterations and/or a large parameter space which require a lot of time
#I have deactivated this here; to enable it, just remove the "#" sign at the beginning of the next line
#message(paste("Progress info: Simulations completed for n =", n))
}
ggplot(result_sem_safeguard, aes(x=n, y=power)) + geom_point() + geom_line() + scale_x_continuous(n.breaks = 18, limits = c(30,200)) + scale_y_continuous(n.breaks = 10, limits = c(0,1)) + geom_hline(yintercept= 0.8, color = "red")
```
This safeguard power analysis yields a required sample size of ca. 63 participants.
::: {.callout-note}
In addition to this safeguard power approach, we would also have liked to derive a smallest effect size of interest (SESOI). However, no prior studies on SESOI in the context of personality/stereotypes were available and the measures/response scales used by Bergh et al. (2016) were only vaguely reported in their manuscript, thereby making it difficult to derive a meaningful SESOI. We therefore only report a safeguard approach but no SESOI approach here.
:::
# A simulation-based power analysis for a mediation model with latent variables
Sometimes, researchers not only wish to investigate whether and how two (latent) variables related to each other, but also whether the association between two (manifest or latent) variables is mediated by a third variable. We will run a power analysis for such a latent mediation model, investigating whether gender affects generalized prejudice (fully or partially) through openness to experience, while using the Bergh et al. (2016) data set as a pilot study. We can repeat the same steps as in our previous power analysis, while incorporating gender into our analysis. Specifically, we will follow these steps:
1. find plausible estimates of the population parameters
2. specify the statistical model
3. simulate data from this population
4. compute the index of interest (e.g., the p-value) and store the results
5. repeat steps 2) and 3) multiple times
6. count how many samples would have detected the specified effect (i.e., compute the statistical power)
7. vary your simulation parameters until the desired level of power (e.g., 80%) is achieved
We first draw on the Bergh et al. (2016) data set to estimate the means and the variance-covariance matrix. In this data set, gender is a factor with two levels, male and female. We first need to transform this categorical variable into an integer variable, for instance coding "male" with 0 and "female" with 1.
```{r means and cov mediation}
Bergh_int <- Bergh
Bergh_int$gender <- ifelse(Bergh_int$gender == "male", 0, ifelse(Bergh_int$gender == "female", 1, NA))
attach(Bergh_int)
#store means
means_mediation <- c(mean(gender), mean(EP), mean(SP), mean(HP), mean(DP), mean(O1), mean(O2), mean(O3)) |> round(2)
#store covarainces
cov_mediation <- cov(cbind(gender, EP, SP, HP, DP, O1, O2, O3)) |> round(2)
```
::: {.callout-tip}
If we were interested in a mediation model with only manifest variables, we could also use a useful shiny app developed by Schoemann et al. (2017), which can be accessed via [https://schoemanna.shinyapps.io/mc_power_med/](https://schoemanna.shinyapps.io/mc_power_med/). This app currently enables power analyses for some manifest mediation models: mediation with (i) one mediator, (ii) two parallel mediators, (iii) two serial mediators, and (iv) three parallel mediators. But as we want to incorporate `generalized prejudice` and `openness to experience` as latent variables, we need to write a simulation-based power analyses ourselves.
:::
Now, we specify the statistical mediation model in `lavaan`.
```{r specify mediation model}
#specify mediation model
model_mediation <- '
# measurement model
generalized_prejudice =~ EP + DP + SP + HP
openness =~ O1 + O2 + O3
# direct effect
generalized_prejudice ~ c*gender
# mediator
openness ~ a*gender
generalized_prejudice ~ b*openness
# indirect effect (a*b)
ab := a*b
# total effect
total := c + (a*b)
'
```
To verify that this model syntax works properly, we can fit this model to the data set provided by Bergh et al. (2016).
```{r test mediation model with the pilot data}
#fit the SEM to the simulated data set
fit_mediation <- sem(model_mediation, data = Bergh_int)
#display the results
summary(fit_mediation)
```
Now, everything is set up to run the actual power analysis. In the following chunk, we repeatedly simulate data from the specified population and store the p-value of the indirect effect while varying the sample size in a range from 500 to 1500.
```{r simulate data mediation, cache=TRUE, warning=FALSE}
#Set seed to make results reproducible
set.seed(21364)
#test ns between 100 and 1500
ns_mediation <- seq(500, 1500, by=50)
#prepare empty vector to store results
result_mediation <- data.frame()
#iterations
iterations_mediation <- 500
#write function
sim_mediation <- function(n, model, mu, sigma) {
simulated_data_mediation <- Rfast::rmvnorm(n = n, mu = mu, sigma = sigma) |> as.data.frame()
fit_mediation_simulated <- sem(model, data = simulated_data_mediation)
p_value_mediation <- parameterestimates(fit_mediation_simulated)[21,]$pvalue
return(p_value_mediation)
}
#replicate function with varying ns
for (n in ns_mediation) {
p_values_mediation <- future_replicate(iterations_mediation, sim_mediation(n = n, model = model_mediation, mu = means_mediation, sigma = cov_mediation), future.seed=TRUE)
result_mediation <- rbind(result_mediation, data.frame(
n = n,
power = sum(p_values_mediation < .005)/iterations_mediation)
)
#The following line of code can be used to track the progress of the simulations
#This can be helpful for simulations with a high number of iterations and/or a large parameter space which require a lot of time
#I have deactivated this here; to enable it, just remove the "#" sign at the beginning of the next line
#message(paste("Progress info: Simulations completed for n =", n))
}
```
Let's plot the results:
```{r plot power curve mediation}
ggplot(result_mediation, aes(x=n, y=power)) + geom_point() + geom_line() + scale_y_continuous(n.breaks = 10) + scale_x_continuous(n.breaks = 20) + geom_hline(yintercept= 0.8, color = "red")
```
This shows that roughly 1,300 to 1,400 participants will be needed to obtain sufficient power under the assumptions we specified. To achieve a more precise estimate, just increase the number of iterations (and get a cup of coffee while you wait for the results 😅).
# References
Bergh, R., Akrami, N., Sidanius, J., & Sibley, C. G. (2016). Is group membership necessary for understanding generalized prejudice? A re-evaluation of why prejudices are interrelated. Journal of Personality and Social Psychology, 111(3), 367–395. [https://doi.org/10.1037/pspi0000064](https://www.researchgate.net/publication/306939541_Is_group_membership_necessary_for_understanding_generalized_prejudice_A_re-evaluation_of_why_prejudices_are_interrelated)
Perugini, M., Gallucci, M., & Costantini, G. (2014). Safeguard power as a protection against imprecise power estimates. Perspectives on Psychological Science, 9(3), 319--332. <https://doi.org/10.1177/1745691614528519>
Schoemann, A. M., Boulton, A. J., & Short, S. D. (2017). Determining power and sample size for simple and complex mediation models. Social Psychological and Personality Science, 8(4), 379–386. [https://doi.org/10.1177/1948550617715068](https://www.researchgate.net/publication/317631067_Determining_Power_and_Sample_Size_for_Simple_and_Complex_Mediation_Models)
Wang, Y. A., & Rhemtulla, M. (2021). Power analysis for parameter estimation in structural equation modeling: A discussion and tutorial. Advances in Methods and Practices in Psychological Science, 4(1), 1–17. [https://doi.org/10.1177/2515245920918253](https://doi.org/10.1177/2515245920918253)