forked from manybabies/mb1-analysis-public
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path04_exclusion.Qmd
190 lines (131 loc) · 5.74 KB
/
04_exclusion.Qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
---
title: "MB1 Exclusions"
author: "The ManyBabies Analysis Team"
date: '`r format(Sys.time(), "%a %b %d %X %Y")`'
output:
html_document:
toc: true
toc_float: true
number_sections: yes
---
# Intro
This script implements and documents exclusion criteria.
It outputs two datafiles:
1. `_trial`, which contains all trials for LMEMs, and
2. `_diff`, which contains trial pairs for meta-analytic effect sizes.
These datafiles are output in three different "preps,"
1. `_main`, which contains all babies (including second-session),
2. `_no2ndsess`, which contains no second session babies, and
3. `_secondary`, which contains the superset of participants who will be analyzed for the 'lab factors' project.
Further exploratory analyses rely on exclusions based on more stringent data contribution standards and are implemented in `05_exploratory_analysis.Rmd`.
```{r setup, echo=FALSE, message=FALSE}
source(here::here("helper/common.R"))
source(here("helper/preprocessing_helper.R"))
```
# Exclusions
Note that all exclusions are written in paper format and sourced here so as to allow matching exactly to the data exclusion script.
```{r child="paper/exclusions.Rmd"}
```
# Trial pairing (blinding, differences)
Remove all training trials. There are a number of cases where there are missing stimulus numbers and trial types. This is problematic and needs to be checked.
```{r}
d %>%
filter(trial_type != "TRAIN") %>%
group_by(lab, subid, stimulus_num) %>%
count %>%
filter(n > 2) %>%
datatable
```
Make sure that our trial pairs are not duplicated once we have removed these missing data and the training trials.
The condition is `trial_pairs$n <= 2` because you don't want any cases where there are >2 trials (because of coding errors or join errors).
```{r}
d <- filter(d, trial_type != "TRAIN",
!is.na(trial_type))
trial_pairs <- d %>%
group_by(lab, subid, stimulus_num) %>%
count
see_if(all(trial_pairs$n <= 2),
msg = "DUPLICATED TRIAL PAIRS")
```
## Blinding
Current data output is **UNBLINDED.**
```{r}
#d <- d %>%
# group_by(lab, subid, stimulus_num) %>%
# mutate(trial_type = base::sample(trial_type))
```
## Compute condition differences
Computing condition differences reveals a lot of issues in the dataset, in particular this line will not work unless there are no duplicated trial pairs (see chunk above).
```{r}
diffs <- d %>%
mutate(trial_num = floor((trial_num+1)/2)) %>%
spread(trial_type, looking_time) %>%
mutate(diff = IDS - ADS)
```
# Construct missing variable
> To test for effects of moderators on the presence of missing data, we constructed a categorical variable (missing), which was true if a trial had no included looking time (e.g., no looking recorded, a look under 2 s, or no looking because the infant had already terminated the experiment).
There is probably a better place to construct this variable, but it depends on assumptions about the structure of the dataset, so it seemed safest to put it after all the validation and checking is done.
Critically, this step fills in the dataset with blank rows; all moderators for the missingness analysis need to be filled in for that. That means we need both `method` and `age` to be complete for each of these, plus other possible moderators (`nae` in particular). This is done in the last line below.
```{r}
d <- d %>%
ungroup %>%
complete(trial_num, nesting(lab, subid)) %>%
mutate(missing = is.na(looking_time) | looking_time < 2) %>%
group_by(lab, subid) %>%
mutate(age_days = age_days[!is.na(age_days)][1],
age_mo = age_mo[!is.na(age_mo)][1],
age_group = age_group[!is.na(age_group)][1],
method = method[!is.na(method)][1],
nae = nae[!is.na(nae)][1])
```
# Remove unvalidated variables
This step filters out a number of variables that were used in the exclusion or have other information in them to create a smaller, cleaner dataset.
```{r}
d <- d %>%
select(lab, subid, subid_unique, trial_order, trial_num,
trial_type, stimulus_num, method, age_days, age_mo, age_group,
nae, gender, second_session, looking_time, missing)
```
# Output
1. `_trial`, which contains all trials for LMEMs, and
2. `_diff`, which contains trial pairs for meta-analytic effect sizes.
These datafiles are output in three different "preps,"
1. `_main`, which contains all babies (including second-session),
2. `_no2ndsess`, which contains no second session babies, and
3. `_secondary`, which *also* retains babies who had a session-level error.
## Main dataset with second-session babies
Note that the total number of second-session babies is really quite tiny. We planned this analysis but we are probably unlikely to be able to learn much from it.
```{r}
d %>%
group_by(lab, subid) %>%
summarise(second_session = all(second_session)) %>%
filter(second_session) %>%
count
```
Output.
```{r}
write_csv(d, "processed_data/03_data_trial_main.csv")
write_csv(diffs, "processed_data/03_data_diff_main.csv")
#write_csv(d, "processed_data/03_data_trial_secondary_UNBLINDED.csv")
#write_csv(diffs, "processed_data/03_data_diff_secondary_UNBLINDED.csv")
```
## Dataset without second session babies
```{r}
d_no2s <- exclude_by(d, quo(second_session))
diffs_no2s <- exclude_by(diffs, quo(second_session))
```
and write:
```{r}
write_csv(d_no2s, "processed_data/03_data_trial_no2ndsess.csv")
write_csv(diffs_no2s, "processed_data/03_data_diff_no2ndsess.csv")
```
## Dataset with *only* second session babies
```{r}
d_2s <- exclude_by(d, quo(!second_session), action = "exclude")
diffs_2s <- exclude_by(diffs, quo(!second_session), action = "exclude")
```
and write:
```{r}
write_csv(d_2s, "processed_data/03_data_trial_2ndsess.csv")
write_csv(diffs_2s, "processed_data/03_data_diff_2ndsess.csv")
```