-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathHierarchical_LME_models.qmd
1016 lines (688 loc) · 35.7 KB
/
Hierarchical_LME_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Hierarchical models (LMEs)"
format: html
editor: visual
---
# Introduction
Experimental data often contain 'clusters'; these are natural groups (e.g. testing sites, subjects, countries) where observations within each cluster tend to be more similar to one another than those in other clusters. Ignoring this grouped structure of the data violates the assumption of independent observations of simple linear regression, which can lead to biased estimates and unreliable statistical inferences.
This week we will explore hierarchical linear models using data from 10 different schools. The goal is to understand how **study hours** affect **test scores**, accounting for both school-specific differences and overall trends. We will fit models using three approaches: **complete pooling, no pooling**, and **hierarchical modeling** with **random intercepts** and **slopes**.
## Simulating Hierarchical School Data
We'll simulate test scores from 10 different schools, where each school has its own intercept (baseline test score) and slope (effect of study hours on test scores). Each student's test score is influenced by how many hours they studied, with some noise added to the data.
```{r}
# Load necessary libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(lme4)
library(tibble)
library(truncnorm)
library(patchwork)
# Set seed for reproducibility
set.seed(123)
# Parameters
n_schools <- 10 # Number of schools
n_students <- 100 # Students per school
mu_study_hours <- seq(8, 2, length.out = n_schools) # Schools with higher average study hours
mu_test_scores <- seq(50, 120, length.out = n_schools) # Schools with higher average test scores
sigma_beta_0 <- 10 # Std dev of intercepts within schools
mu_beta_1 <- 3 # Average slope for study hours
sigma_beta_1 <- 1 # Std dev of slopes (to ensure positive slopes within schools)
sigma_y <- 10 # Std dev of test scores
# Simulate data for each school
data <- data.frame()
for (i in 1:n_schools) {
# Simulate study hours for each student in the school
study_hours <- rtruncnorm(n_students, a = 0, b = Inf, mean = mu_study_hours[i], sd = 1)
# Simulate school-specific intercept (baseline test score)
beta_0 <- rnorm(1, mean = mu_test_scores[i], sd = sigma_beta_0)
# Simulate school-specific slope (effect of study hours)
beta_1 <- rnorm(1, mean = mu_beta_1, sd = sigma_beta_1)
# Generate test scores
test_scores <- beta_0 + beta_1 * study_hours + rnorm(n_students, sd = sigma_y)
# Combine school data into the main data frame
school_data <- data.frame(school = factor(i), study_hours = study_hours, test_scores = test_scores)
data <- rbind(data, school_data)
}
```
### Explanation
In this dataset:
- We have 10 schools, each with 100 students.
- The intercept (`beta_0`) for each school represents the baseline test score (with 0 study hours), varying by school.
- The slope (`beta_1`) represents the effect of study hours on test scores, which may differ across schools.
- Each student's test score is influenced by their study hours and school-specific factors, with some added noise to represent unobserved variability.
## Complete Pooling Model
In the complete pooling model, we assume that all schools have the same intercept and slope, effectively ignoring the hierarchical structure.
$$test\_scores_{i}=β_0+β_1study\_hours_{i}+ϵ_{i} $$
where,
- $β_0$ is the common or **fixed intercept** for all schools.
- $β_1$ is the common or **fixed slope** for all schools.
- $ϵ_i∼N(0,σ^2)$ represents the residual error.
**Assumptions of the complete pooling model:**
1. The observations (test_scores) are normally distributed.
2. Observations are independent.
3. There is a linear relationship between the outcome and predictor variables.
We can already see that this model is inappropriate as assumption 2 is being violated: the test scores of students in the same school are not independent.
![](complete_pooling_model.png)
```{r}
complete_pooling_model_lm <- lm(test_scores ~ study_hours,
data = data)
summary(complete_pooling_model_lm)
# Extract the fixed effects (coefficients)
fixed_effects <- coef(complete_pooling_model_lm)
# Extract the effect of 'study_hours' on 'test_scores'
study_hours_effect <- fixed_effects["study_hours"]
# Visualize complete pooling model
ggplot(data, aes(x = study_hours, y = test_scores)) +
geom_point(alpha = 0.5, color = "blue") +
# Plot the regression line using the lm coefficients
geom_abline(intercept = fixed_effects["(Intercept)"],
slope = fixed_effects["study_hours"],
color = "red") +
labs(title = "Complete Pooling Model", x = "Study Hours", y = "Test Scores") +
theme_minimal() +
# Annotate the correlation using the slope of the regression line (study_hours_effect)
annotate("text", x = Inf, y = -Inf,
label = sprintf("Effect of Study Hours: %.2f", study_hours_effect),
hjust = 1.1, vjust = -0.5, color = "black", size = 5, fontface = "italic")
# Vizualize the relationship per school
ggplot(data, aes(x = study_hours, y = test_scores, color = school)) +
geom_point(alpha = 0.5) +
ggtitle("Relationship Between Study Hours and Test Scores")
```
### Explanation
The complete pooling model assumes all schools have the same relationship between study hours and test scores. The red line represents the single fitted linear relationship across all students, ignoring school-specific differences.
It is quite unexpected the there would be a negative relationship between study_hours and test_scores.
From the second plot we can begin to understand what is going on. The data is clustered in groups (i.e. schools), and therefore observations from different schools should not be pooled together. Let's see what happens when we fit a separate linear regression model for each school.
## No Pooling Model
The no pooling model fits a separate linear model for each school, treating the schools independently.
$$test\_scores_{ij}=β_{0j}+β_{1j}study\_hours_{ij}+ϵ_{ij}$$
where:
- $β_{0j}$ and $β_{1j}$ are the intercept and slope for school j.
- $ϵ_{ij}∼N(0,σ^2)$ represents the residual error.
**Assumptions of the no pooling model:**
1. The observations (test_scores) are normally distributed.
2. Every school's model is fit independently of the other schools; there are no common parameters between schools (except from the standard deviation which is the same for all schools).
3. There is a linear relationship between the outcome and predictor variables. ![](no_pooling_model.png)
```{r}
# Initialize a list to store the models
school_lm_models <- list()
# Get unique schools
unique_schools <- unique(data$school)
# Loop over each school
for (school in unique_schools) {
# Subset the data for the current school
school_data <- data[data$school == school, ]
# Run the linear regression for this school's data
model <- lm(test_scores ~ study_hours, data = school_data)
# Store the model in the list, using the school name as the key
school_lm_models[[school]] <- model
}
# Access the models for each school
school_lm_models[['School A']] # Example of accessing a specific school's model
# To view summaries for each model
for (school in unique_schools) {
cat("\n\nSummary for", school, ":\n")
print(summary(school_lm_models[[school]]))
}
# Plot the results
ggplot(data, aes(x = study_hours, y = test_scores, color = factor(school))) +
geom_point(alpha = 0.5) +
facet_wrap(~school) +
geom_smooth(method = "lm", se = FALSE) + # Adds a regression line without confidence interval shading
labs(title = "No Pooling LME Model", x = "Study Hours", y = "Test Scores") +
theme_minimal()
```
### Explanation
The no pooling model fits separate linear regressions for each school, capturing individual school trends. The plot shows distinct linear fits for each school, but without sharing any information between them.
Here we see that when we fit separate linear models for each school, there is a positive relationship between study_hours and test_scores, as expected. Compare this with the negative relationship we got from the overall model. This is called the **Simpson's paradox:** the phenomenon whereby a relationship seen in the overall model disappears or reverses when the data is divided into groups.
One problem with the no pooling model is that it cannot be generalized to a group outside of our sample. It assumes, in other words, that the relationship between study hours and test scores in one school can tell us nothing about this relationship in another school which is 'under-utilizing' the data.
### Note
Before we fit any other model, lets see what happens to our overall model when we add school as a co-variate.
```{r}
model_school_covariate <- lm(test_scores ~ study_hours + factor(school), data = data)
summary(model_school_covariate)
```
**Interpretation:**
**Intercept:** the average test scores of students in school 1 when study hours = 0.
**study_hours:** the effect of study_hours on test_scores when controlling for differences in mean test_scores across schools.
**schoolX**: the difference in average test_scores between school 1 and schoolX, when controlling for study hours.
Lets see what happens when we add school as an interaction term.
```{r}
model_school_interaction<- lm(test_scores ~ study_hours*factor(school), data = data)
summary(model_school_interaction)
```
**Interpretation:**
**Intercept:** the average test scores of students in school 1 when study hours = 0.
**study_hours:** the effect of study_hours on test_scores in school 1.
**schoolX**: the difference in baseline test_scores when study_hours = 0 between school 1 and school X.
**study_hours:schoolX** :the difference in the slope of the relationship between study_hours and test_scores between school 1 and school X.
##
## Hierarchical Model with Random Intercepts and Slopes (Partial Pooling model)
Now we'll fit a hierarchical LME model that allows each school to have its own intercept and slope, but the group-level parameters are drawn from a common distribution. This model strikes a balance between the complete pooling and no pooling approaches.
$$test\_scores_{ij}=β_{0j}+β_{1j}study\_hours_{ij}+ϵ_{ij}$$
where:
- $β_{0j}∼N(μ_{β_0},σ_{β_0})$
- $β_{1j}∼N(μ_{β_1},σ_{β_1})$
- $ϵ_{ij}∼N(0,σ^2)$
#### **Or another way to think about it,**
$$test\_scores_{ij}=(β_0 + b_{0j})+(β_1+ b_{ij})study\_hours_{ij}+ϵ_{ij} $$
- $β_0$ is the common or **fixed intercept** for all schools.
- $b_{0j}$ is the **random intercept** for school *j*. It represents the deviation of the *j*-th school's intercept from $β_0$.
- $β_1$ is the common or **fixed slope** for all schools.
- $b_{1j}$ is the **random slope** for school *j*. It represents the deviation of the *j*-th school's slope from $β_1$.
- $ϵij∼N(0,σ^2)$ represents the residual error.
**Assumptions of the hierarchical model:**
1. The observations (test_scores) are normally distributed.
2. Every school's model deviates to some extent from the grand mean and from the mean effect of study_hours. This implies that there is some between-school variability in the individual-level intercept and slope adjustments by school.
3. There is a linear relationship between the outcome and predictor variables.
![](partial_pooling_model.png)
```{r}
# Fit hierarchical model with random intercepts and slopes using `lme4`
hierarchical_model <- lmer(
test_scores ~ study_hours + (1 + study_hours | school),
data = data
)
# Summary of the model
summary(hierarchical_model)
# Extract fixed effects
fixed_effects <- fixef(hierarchical_model)
fixed_slope <- fixed_effects[2]
fixed_intercept <- fixed_effects[1]
# Extract fitted values from the hierarchical model
data$partial_pooling_predicted <- predict(hierarchical_model, newdata = data)
# Example of creating fixed effect predictions
fixed_effects <- data.frame(
study_hours = seq(min(data$study_hours), max(data$study_hours), length.out = 100),
test_scores = fixed_intercept + fixed_slope * seq(min(data$study_hours), max(data$study_hours), length.out = 100)
)
# Visualization of hierarchical model with overall line
ggplot(data, aes(x = study_hours, y = test_scores, color = school)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = partial_pooling_predicted), size = 1) + # Shrinkage lines from random effects model
geom_abline(intercept=fixed_intercept, slope=fixed_slope, color="black", linetype="dashed", size=1) +
labs(title = "Hierarchical LME Model with Random Intercepts and Slopes",
x = "Study Hours",
y = "Test Scores") +
theme_minimal()+ annotate("text", x = Inf, y = Inf, label = paste("Fixed Effect Coefficient for Study Hours: ", round(fixed_slope, 2)),
hjust = 1.1, vjust = 1.5, size = 5, color = "black")
```
```{r}
# Separate linear models for each school (No Pooling)
data <- data %>%
group_by(school) %>%
mutate(no_pooling_predicted = predict(lm(test_scores ~ study_hours, data = cur_data())))
# No Pooling Plot: Separate regression lines for each school (no influence from other schools)
ggplot(data, aes(x = study_hours, y = test_scores, color = factor(school))) +
geom_point(alpha = 0.5) +
geom_line(aes(y = no_pooling_predicted), size = 1) +
# Independent school lines
labs(title = "No Pooling: Separate Regressions per School",
x = "Study Hours",
y = "Test Scores") +
facet_wrap(~school) +
theme_minimal()
# Extract fitted values from the hierarchical model
data$partial_pooling_predicted <- predict(hierarchical_model, newdata = data)
# Example of creating fixed effect predictions
fixed_effects <- data.frame(
study_hours = seq(min(data$study_hours), max(data$study_hours), length.out = 100),
test_scores = fixed_intercept + fixed_slope * seq(min(data$study_hours), max(data$study_hours), length.out = 100)
)
# Plot with fixed effects line
ggplot(data, aes(x = study_hours, y = test_scores, color = factor(school))) +
geom_point(alpha = 0.5) + # Observed data points
geom_line(aes(y = partial_pooling_predicted), size = 1) + # Shrinkage lines from random effects model
#geom_line(data = fixed_effects, aes(y = test_scores), linetype = "dashed", size = 1, color = "black") + # Fixed effect line
labs(title = "Partial Pooling: Random Effects Model",
x = "Study Hours",
y = "Test Scores") +
facet_wrap(~school) + # Separate plot per school
theme_minimal()
```
### Shrinkage
When we compare the individual plots for the no pooling and partial pooling models we can see that the individual slopes are slightly different. This is due to '**Shrinkage':** a phenomenon whereby individual-level estimates are shrunk towards the mean estimates over the entire data. The more extreme the data from a given subject relative to the mean estimates from the data set, or the less data we have from a given subject, the greater the shrinkage.
# Week 13: LME on time-series data
This week we will look at how to fit an LME on time-series data (aka repeated measures or longitudinal data).
Let's start by loading the Cherry blossom dataset from the bayesrules() package. This dataset contains the running times (in minutes) for 36 runners in the annual 10-mile Cherry Blossom race in Wadhington, D.C. The runners are in their 50s or 60s and have competed multiple times.
```{r}
#install.packages("bayesrules")
library(bayesrules)
# Load data
data(cherry_blossom_sample)
running <- cherry_blossom_sample %>%
select(runner, age, net)%>%
na.omit()
head(running)
ggplot(running, aes(x = runner, y = net)) +
geom_boxplot()
```
### Complete pooling model
In the complete pooling model, we assume that all runners have the same intercept and slope, effectively ignoring the hierarchical structure.
$$running\_times_{i}=β_0+β_1age_{i}+ϵ_{i} $$
where,
- $β_0$ is the common or **fixed intercept** for all runners.
- $β_1$ is the common or **fixed slope** for all runners.
- $ϵ_i∼N(0,σ^2)$ represents the residual error.
**Assumptions of the complete pooling model:**
1. The observations (running_times) are normally distributed.
2. Observations are independent.
3. There is a linear relationship between the outcome and predictor variables.
```{r}
# Complete pooling model: treating all runners as a single group
complete_pooling_model <- lm(net ~ age, data = running)
# Summary of the complete pooling model
summary(complete_pooling_model)
ggplot(running, aes(y = net, x = age)) +
geom_point()+
geom_smooth(method = "lm", se = FALSE, color= 'red')
```
**Interpretation**:
- **Intercept:** The average running time at age 0 (clearly not a useful estimate).
- **age:** The average difference in running times each year.
This estimate of the age coefficient β1 suggests that running times tend to increase by a mere 0.27 minutes for each year in age. Further, this is not a statistically significant increase. This is unlikely to be true, as our intuition tells us that as people age, they get slower at running.
### No Pooling Model
The no pooling model fits a separate linear model for each runner, treating the runners independently.
$$running\_times_{ij}=β_{0j}+β_{1j}age_{ij}+ϵ_{ij}$$
where:
- $β_{0j}$ and $β_{1j}$ are the intercept and slope for runner j.
- $ϵ_{ij}∼N(0,σ^2)$ represents the residual error.
**Assumptions of the no pooling model:**
1. The observations (running_times) are normally distributed.
2. Every runner's model is fit independently of the other runners; there are no common parameters between runners (except from the standard deviation which is the same for all runners).
3. There is a linear relationship between the outcome and predictor variables.
```{r}
# Initialize a list to store the modelsr
runner_lm_models <- list()
# Get unique runners
unique_runners <- unique(running$runner)
# Loop over each runner
for (runner in unique_runners) {
# Subset the data for the current runner
running_data <- running[running$runner == runner, ]
# Run the linear regression for this runner's data
model <- lm(net ~ age, data = running_data)
# Store the model in the list, using the runner name as the key
runner_lm_models[[runner]] <- model
}
# Access the models for each runner
runner_lm_models[['Runner 1']] # Example of accessing a specific runner's model
# To view summaries for each model
for (runner in unique_runners) {
cat("\n\nSummary for", runner, ":\n")
print(summary(runner_lm_models[[runner]]))
}
# Plot the results
ggplot(running, aes(x = age, y = net, color = factor(runner))) +
geom_point(alpha = 0.5) +
facet_wrap(~runner) +
geom_smooth(method = "lm", se = FALSE) + # Adds a regression line without confidence interval shading
labs(title = "No Pooling LME Model", x = "Age", y = "Running Time") +
theme_minimal()
```
**Interpretation**:
- **Intercept:** The predicted running time for each runner at age 0.
- **age:** The difference in running times each year for each runner.
**Problems with the no pooling method:**
1. We cannot reliably generalize to groups runners those in our sample.
2. It assumes that data from one runner cannot tell us anything useful about another runner and thus ignores potentially valuable information. This is especially problematic when we have a small number of observations per runner.
## Hierarchical Model with Random Intercepts and Slopes (Partial Pooling model)
$$running\_times_{ij}=β_{0j}+β_{1j}age_{ij}+ϵ_{ij}$$
where:
- $β_{0j}∼N(μ_{β_0},σ_{β_0})$
- $β_{1j}∼N(μ_{β_1},σ_{β_1})$
- $ϵ_{ij}∼N(0,σ^2)$
#### **Or another way to think about it,**
$$running\_times_{ij}=(β_0 + b_{0j})+(β_1+ b_{ij})age_{ij}+ϵ_{ij} $$
- $β_0$ is the common or **fixed intercept** for all runners.
- $b_{0j}$ is the **random intercept** for runner *j*. It represents the deviation of the *j*-th runner's intercept from $β_0$.
- $β_1$ is the common or **fixed slope** for all runners.
- $b_{1j}$ is the **random slope** for runner *j*. It represents the deviation of the *j*-th runner's slope from $β_1$.
- $ϵij∼N(0,σ^2)$ represents the residual error.
**Assumptions of the hierarchical model:**
1. The observations (running_times) are normally distributed.
2. Every runner's model deviates to some extent from the grand mean and from the mean effect of age. This implies that there is some between-runner variability in the individual-level intercept and slope adjustments by runner.
3. There is a linear relationship between the outcome and predictor variables.
```{r}
library(lmerTest)
# Mixed effects model: allowing individual random effects for intercepts and slopes
mixed_effects_model <- lmer(net ~ age + (1+ age| runner), data = running)
# Summary of the mixed effects model
summary(mixed_effects_model)
# Extract fitted values from the hierarchical model
running$partial_pooling_predicted <- predict(mixed_effects_model, newdata = running)
# Extract fixed effects
fixed_effects <- fixef(mixed_effects_model)
fixed_slope <- fixed_effects[2]
fixed_intercept <- fixed_effects[1]
# Visualization of hierarchical model with overall line
ggplot(running, aes(x = age, y = net, color = runner)) +
geom_point(alpha = 0.5) +
geom_line(aes(y = partial_pooling_predicted), size = 1) + # Shrinkage lines from random effects model
geom_abline(intercept=fixed_intercept, slope=fixed_slope, color="black", linetype="dashed", size=1) +
labs(title = "Hierarchical LME Model with Random Intercepts and Slopes",
x = "Age",
y = "Running times") +
theme_minimal()+ annotate("text", x = Inf, y = Inf, label = paste("Fixed Effect Coefficient for age: ", round(fixed_slope, 2)),
hjust = 1.1, vjust = 1.5, size = 5, color = "black")
```
**Interpretation:**
**Fixed effects**
- **Intercept:** The predicted average running time when age is 0.
- **age:** The average difference in running times each year.
**Random effects**
- **Groups (runner)**: The random effects are grouped by runner.
- **Intercept**: This indicates the degree to which each runner's baseline performance (when age is 0) varies from the overall average intercept.
- **age**: This tells us how much the effect of age on running times differs for each runner.
- **Corr**: This shows the correlation between the random intercepts and slopes for runners. A positive correlation suggests that runners with higher baseline running times (higher intercepts) tend to show less decline in performance with age (smaller negative slope), while a negative correlation would suggest the opposite.
**Residual Variance**
- **Residual**: This is the variance in running times that remains unexplained by the model. The standard deviation tells us the typical size of the residuals, indicating how much the observed running times deviate from the predicted running times on average.
#### Visualize the shirnkage
```{r}
# Plot the no pooled graphs
ggplot(running, aes(x = age, y = net, color = factor(runner))) +
geom_point(alpha = 0.5) +
facet_wrap(~runner) +
geom_smooth(method = "lm", se = FALSE) + # Adds a regression line without confidence interval shading
labs(title = "No Pooling Model", x = "Age", y = "Running Times") +
facet_wrap(~runner)+
theme_minimal()
# Plot the partially polled graphs
ggplot(running, aes(x = age, y = net, color = factor(runner))) +
geom_point(alpha = 0.5) + # Observed data points
geom_line(aes(y = partial_pooling_predicted), size = 1) + # Shrinkage lines from random effects model
labs(title = "Partial Pooling: Random Effects Model",
x = "Age",
y = "Running Times") +
facet_wrap(~runner) + # Separate plot per school
theme_minimal()
```
## Week 14: Bayesian LME
### Load surprise pilot 21 data and get a random sample of 10 people
```{r}
# Load necessary libraries
library(tidyverse)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lme4)
library(tibble)
library(truncnorm)
library(brms)
library(rstan)
set.seed(123)
pilot_21_data <-read.csv("pilot_21_variables.csv")
#Identify unique participants
unique_participants <- unique(pilot_21_data$Random_ID)
#Randomly sample 10 unique participants
set.seed(123) # Setting seed for reproducibility
sampled_participant_ids <- sample(unique_participants, 10)
#Extract trials corresponding to the sampled participants
sampled_pilot_21_data <- pilot_21_data %>%
filter(Random_ID %in% sampled_participant_ids)
# Print sampled data
print(sampled_pilot_21_data)
```
### Complete Pooling Model
In the complete pooling model, we assume that all participants have the same intercept and slope, ignoring the hierarchical structure.
$$
Y_{i}|β_{0},β_1,σ ∼N(μ_i,σ^2)\ with\ μ_i=β_0 + β_1X_{i}
$$
$$Anxiety_{i}=β_0+β_1PE_{i}+ϵ_{i} $$
where,
- $Y_i$ and $X_i$ represent participant $i$'s outcome (anxiety) and predictor variable (PE), respectively.
- $β_0$ is the common or **fixed intercept** for all participants.
- $β_1$ is the common or **fixed slope** for all participants.
- $ϵ_i∼N(0,σ^2)$ represents the residual error.
**Priors**:
- $β_0∼N(50,10)$
- $β_1∼N(0,5)$
- $σ∼Cauchy(0,2.5)$
**Assumptions of the complete pooling model:**
1. The observations (anxiety ratings) are normally distributed.
2. Observations are independent.
3. There is a linear relationship between the outcome and predictor variables.
![](complete_pooling_model.png)
#### brms
```{r}
# Fit complete pooling model using `brms`
complete_pooling_model <- brm(
Response_Ax ~ Response_SubjPE,
data = sampled_pilot_21_data,
family = gaussian(),
prior = c(
prior(normal(50, 10), class = Intercept),
prior(normal(0, 5.5), class = b),
prior(cauchy(0, 2.5), class = sigma)
),
iter = 2000, warmup = 1000, chains = 4, cores = 2
)
# Summary of the complete pooling model
summary(complete_pooling_model)
fixed_effects <- fixef(complete_pooling_model)
PE_effect <- fixed_effects["Response_SubjPE", "Estimate"]
# Visualize complete pooling model
ggplot(sampled_pilot_21_data, aes(x = Response_SubjPE, y = Response_Ax)) +
geom_point(alpha = 0.5, color = "blue") +
geom_abline(intercept = fixef(complete_pooling_model)["Intercept", "Estimate"],
slope = fixef(complete_pooling_model)["Response_SubjPE", "Estimate"],
color = "red") +
labs(title = "Complete Pooling Bayesian Model", x = "Prediction Error", y = "Anxiety") +
theme_minimal()+
annotate("text", x = Inf, y = -Inf, label = sprintf("Correlation: %.2f", PE_effect),
hjust = 1.1, vjust = -0.5, color = "black", size = 5, fontface = "italic")
```
This positive correlation between PE and Anxiety suggests that the more positive the PE (i.e. the more positively surprising the outcome) the more anxious people get. This is highly counter-intuitive, I think. Let's see what happens when we model the hierarchical structure of the data.
### Partial Pooling/Hierarchical Bayesian Model with Random Intercepts
Now we'll fit a hierarchical Bayesian model that allows each participant to have their own intercept but the group-level parameters are drawn from a common distribution.
$$Anxiety_{ij}=β_{0j}+β_{1}PE_{ij}+ϵ_{ij}$$
where:
- $β_{0j}∼N(μ_{β_0},σ_{β_0})$
- $β_1$ is the common or fixed slope for all participants.
- $ϵ_{ij}∼N(0,σ^2)$
**Priors**:
$μ_{β_0}∼N(50,10)$
$σ_{β0}∼Cauchy(0,2.5)$
$β_1∼N(0,5)$
$σ∼Cauchy(0,2.5)$
### **Or another way to think about it,**
$$
Y_{ij}|β_{0j},β_{1},σ_y ∼N(μ_{ij},σ^2_y)\ with\ μ_{ij}=(β_0 + b_{0j})+β_1X_{ij}
$$
$$Anxiety_{ij}=(β_0 + b_{0j})+β_1study\_hours_{ij}+ϵ_{ij} $$
- $Y_i$ and $X_i$ represent the outcome (anxiety) and predictor variable (prediction error), respectively, for trial $i$ from participant $j$.
- $β_0$ is the common or **fixed intercept** for all participants.
- $b_{0j}$ is the **random intercept** for participant *j*. It represents the deviation of the *j*-th participant's intercept from $β_0$.
- $β_1$ is the common or **fixed slope** for all participants.
- $ϵ_{ij}∼N(0,σ^2)$ represents the residual error.
**Priors**:
- $β_0∼N(50,10)$
- $b_{0j}∼N(0,σ_{0j})$
- $β_1∼N(0,5)$
- $σ_{0j} ∼Cauchy(0,2.5)$
- $σ_y∼Cauchy(0,2.5)$
**Assumptions of the hierarchical model:**
1. The observations (anxiety) are normally distributed.
2. Every participant's model deviates to some extent from the grand mean. This implies that there is some between-subject variability in the individual-level intercept adjustments.
3. There is a linear relationship between the outcome and predictor variables.
![](partial_pooling_model.png)
#### brms
```{r}
#brms
library(brms)
fit_brms_lme <- brm(
Response_Ax ~ Response_SubjPE + (1 | Random_ID),
data = sampled_pilot_21_data,
family = gaussian(),
prior = c(
prior(normal(0, 1), class = "b", coef = "Response_SubjPE"), # Using normal priors throughout here
prior(cauchy(0, 2.5), class = "sigma"),# Prior for the error term, the Cauchy distribution has some great properties with heavy tails and is quite robust
prior(cauchy(0, 2.5), class = "sd" , group= "Random_ID")
),
iter = 2000, # This is for the MCMC
chains = 4,
seed = 123
)
fit_brms_lme
# These are now posterior distributions
plot(fit_brms_lme)
# Extracting the random effects
random_effects <- ranef(fit_brms_lme)
print(random_effects)
# Optional: Extracting the random effects for Random_ID specifically
random_effects_random_id <- random_effects$Random_ID
print(random_effects_random_id)
# Create new data for predictions based only on fixed effects
new_data_fixed <- sampled_pilot_21_data %>%
distinct(Response_SubjPE) %>%
mutate(Random_ID = NA) # For fixed effects, we set Random_ID to NA to ignore random effects
# Get predictions for fixed effects only
pred_fixed <- fitted(fit_brms_lme, newdata = new_data_fixed, re_formula = NA) # re_formula = NA for fixed effects
new_data_fixed$Response_Ax <- pred_fixed[, "Estimate"]
# Get predictions including random effects
pred_random <- fitted(fit_brms_lme, newdata = sampled_pilot_21_data)
# Add these predictions back to the original data
sampled_pilot_21_data$predicted_random <- pred_random[, "Estimate"]
# Plot with ggplot
ggplot() +
# Random effects: one line per Random_ID, colored by Random_ID
geom_line(data = sampled_pilot_21_data, aes(x = Response_SubjPE, y = predicted_random, group = Random_ID, color = as.factor(Random_ID)),
alpha = 0.7) +
# Scatter plot of the raw data, points colored by Random_ID
geom_point(data = sampled_pilot_21_data, aes(x = Response_SubjPE, y = Response_Ax, color = as.factor(Random_ID)),
alpha = 0.8) +
# Fixed effects: a single line (not colored by Random_ID)
geom_line(data = new_data_fixed, aes(x = Response_SubjPE, y = Response_Ax),
color = "black", size = 1.2, linetype = "dashed") +
labs(
title = "Fixed and Random Effects Plot Colored by Random_ID",
x = "Response_SubjPE",
y = "Response_Ax",
color = "Random_ID"
) +
theme_minimal() +
theme(legend.position = "right")
```
Plot the individual regressions to compare with the LME
```{r}
# Fit individual linear models (lm) for each Random_ID
lm_fits <- sampled_pilot_21_data %>%
group_by(Random_ID) %>%
do(model = lm(Response_Ax ~ Response_SubjPE, data = .))
# Extract the predicted values for each Random_ID
predictions <- lm_fits %>%
rowwise() %>%
do({
data.frame(
Response_SubjPE = .$model$model$Response_SubjPE,
Random_ID = .$Random_ID,
predicted = predict(.$model),
Response_Ax = .$model$model$Response_Ax
)
}) %>%
ungroup()
# Now, plot the individual regression lines and points for each Random_ID
ggplot(predictions, aes(x = Response_SubjPE, y = Response_Ax, color = as.factor(Random_ID))) +
# Add individual regression lines for each Random_ID
geom_line(aes(y = predicted, group = Random_ID), alpha = 0.7) +
# Add the actual data points for each Random_ID
geom_point(aes(y = Response_Ax), alpha = 0.8) +
labs(
title = "Individual Linear Models by Random_ID",
x = "Response_SubjPE",
y = "Response_Ax",
color = "Random_ID"
) +
theme_minimal() +
theme(legend.position = "right")
```
#### stan
```{r}
#stan
library(dplyr)
library(rstan)
# Prepare data for Stan
stan_data <- list(
N = nrow(sampled_pilot_21_data),
K = 1, # Number of predictors
J = length(unique(sampled_pilot_21_data$Random_ID)),
y = sampled_pilot_21_data$Response_Ax,
x = sampled_pilot_21_data$Response_SubjPE,
participant = as.numeric(factor(sampled_pilot_21_data$Random_ID))
)
stan_model_code <- "
data {
int<lower=0> N; // number of observations
int<lower=0> K; // number of predictors
int<lower=1> J; // number of participants
vector[N] y; // response variable
vector[N] x; // predictor variable
int<lower=1,upper=J> participant[N]; // participant ID for each observation
}
parameters {
real alpha; // intercept
real beta; // slope
real<lower=0> sigma; // error scale
vector[J] u; // random effects
real<lower=0> tau; // random effect standard deviation
}
model {
// Priors
alpha ~ normal(50, 10);
beta ~ normal(0, 1);
sigma ~ cauchy(0, 2.5);
tau ~ cauchy(0, 2.5);
// Likelihood
for (n in 1:N) {
y[n] ~ normal(alpha + beta * x[n] + u[participant[n]], sigma);
}
// Random effects
u ~ normal(0, tau);
}
"
# Compile the model
stan_model <- stan_model(model_code = stan_model_code)
# Fit the model
fit <- sampling(stan_model, data = stan_data, iter = 2000, chains = 4, seed = 123)
# Print the fit summary
print(fit)
# Extract posterior samples
posterior_samples <- rstan::extract(fit)
# Calculate mean and 95% credible intervals for random effects (deviations)
u_means <- colMeans(posterior_samples$u)
u_lower <- apply(posterior_samples$u, 2, quantile, probs = 0.025)
u_upper <- apply(posterior_samples$u, 2, quantile, probs = 0.975)
# Calculate the intercept mean
alpha_mean <- mean(posterior_samples$alpha)
# Create a data frame for plotting
deviations_df <- data.frame(
participant = unique(sampled_pilot_21_data$Random_ID),
mean_deviation = u_means + alpha_mean, # Adjust deviations
lower_bound = u_lower + alpha_mean, # Adjust lower bound
upper_bound = u_upper + alpha_mean # Adjust upper bound
)
# Plot the deviations with intercept line
ggplot(deviations_df, aes(x = participant, y = mean_deviation)) +
geom_point() +
geom_errorbar(aes(ymin = lower_bound, ymax = upper_bound), width = 0.2) +
geom_hline(yintercept = alpha_mean, linetype = "dashed", color = "red") + # Intercept line
labs(title = "Individual Deviations from Intercept",
x = "Participant ID",
y = "Deviation from Intercept") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
annotate("text", x = 1, y = alpha_mean + 1, label = paste("Intercept =", round(alpha_mean, 2)), color = "red", hjust = 0)
```
**Interpretation**
- **alpha:** The fixed intercept; the average anxiety across participants when PE=0
- **beta:** The fixed slope; the average drop in anxiety which every point increase in PE.
- **sigma:** The standard deviation of anxiety ratings across participants.
- **u\[X\]:** Participant X's deviation from the fixed intercept.