-
Notifications
You must be signed in to change notification settings - Fork 0
/
vignette-tvem-pdf.Rmd
353 lines (258 loc) · 25.3 KB
/
vignette-tvem-pdf.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
---
title: "Fitting time-varying effects models"
author: "John J. Dziak"
date: "`r Sys.Date()`"
output: rmarkdown::pdf_document
vignette: >
%\VignetteIndexEntry{Fitting time-varying effects models}
%\VignetteEngine{knitr::pdf_document}
%\VignetteEncoding{UTF-8}
---
```{r setup, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>"
)
```
In this example we simulate a longitudinal dataset and fit a simple time-varying coefficient model to it using the `tvem` package. We show that this can be done for either a continuous or a binary outcome variable. Time-varying coefficient models are discussed further by Tan, Shiyko, Li, Li and Dierker (2012). They are an application of the varying-coefficient models approach of Hastie and Tibshirani (1993) to intensive longitudinal data.
Before running the examples, first install and load the `tvem` package.
A `.zip` or `.tar.gz` file containing the package is available at [https://github.com/dziakj1/TvemPackage_development](https://github.com/dziakj1/TvemPackage_development), and it can then be used with the `install.packages()` function in R code,
`Packages > Install Package(s)` from Local Files in the R graphical user interface,
or `Tools > Install Packages` in the RStudio application, to install the package.
The package is also available at [https://github.com/dziakj1/tvem](https://github.com/dziakj1/tvem) for use with the `install_github` function as follows:
```{r,eval=FALSE}
install.packages(’devtools’)
library(devtools)
install_github(’dziakj1/tvem’)
```
It is also available from the CRAN archive using the command
```{r,eval=FALSE}
install.packages("tvem")
````
Of course, if you are viewing this guide from within R using the `vignette()` function, then the package is already installed.
After you install the package on your system, you can then load it as usual with the `library()` function.
```{r}
library(tvem)
```
# Example with a continuous outcome variable
The `tvem` package has a function for simulating a dataset. It is good to start by specifying a random seed.
```{r}
set.seed(123)
the_data <- simulate_tvem_example()
```
## Exploring the dataset
When analyzing any dataset, it is important to examine it descriptively first.
```{r}
print(head(the_data))
print(summary(the_data))
```
The dataset is in long form (one row per observation, with multiple observation times for each participant). There are 300 participants. The observation time ranges from 0 (thought of as the beginning of an intensive study) to 7 (imagined as the end of the study seven days later). There is a single response variable $y$, and two predictor variables (covariates), $x_1$ and $x_2$. In the context of an intensive longitudinal study with human participants, these variables might be ratings of different feelings, symptoms, or behaviors, reported a few times per day at random times, during seven days following an event (such as the beginning of an intervention, treatment, lifestyle change, etc.). The values of the covariates and the response vary over time within subject. The analyst wishes to find out whether their means change systematically over time, whether they are interrelated, and whether this relationship, if it exists, changes over time.
## Plotting average change over time
One easy thing to do is to investigate whether and how the response changes over time on average. This is simply curve fitting, similar to polynomial regression, but can be fit using the TVEM function, in an approach sometimes called `intercept-only TVEM.' This approach uses a spline function to approximate the average change in $y$ over time.
By default, the tvem function will fit a penalized B-spline (de Boor, 1972), which is called a "P-spline" in the terminology of Eilers and Marx (1996). This approach uses an automatic tuning penalty to choose the level of smoothness versus flexibility of the fitted function. It is similar, though not identical, to the P-splines used in the Methodology Center's `%TVEM` macro for the SAS programming language. The P-splines used by the `%TVEM` macro are penalized truncated power splines (Li et al., 2017; see Ruppert, Wand & Carroll, 2003).
```{r}
model1 <- tvem(data=the_data,
formula=y~1,
id=subject_id,
time=time)
```
You also have the option to turn off the penalty and control the smoothness yourself, by specifying the number of interior knots, here 2.
```{r}
model1_2knots_unpenalized <- tvem(data=the_data,
formula=y~1,
id=subject_id,
num_knots=2,
penalize=FALSE,
time=time)
```
The implied mean model is $E(y|t) =\beta_0(t)$ where $t$ is time in days.
After fitting a model, you can print a summary and plot the estimated coefficient.
```{r,fig.width=4,fig.height=4}
print(model1_2knots_unpenalized)
plot(model1_2knots_unpenalized)
```
The plot shows the estimated coefficient function, and approximate estimates for 95% pointwise confidence intervals (not corrected for potential multiple comparisons) for the value of the function at each time. The expected value of $Y$, represented in this model by the single time-varying function $\beta(t)$, is seen increasing from about 2 at the beginning of the interval of study to about 7 at the end.
We do not estimate random effects in the current version of this package. Instead, we use a (possibly penalized) form of generalized estimating equations with working independence, and adjust the standard errors for within-subject correlation using a sandwich formula.
## Using the `select_tvem` function
It is a very good idea to examine how the covariate means change over time. That is, we should fit intercept-only TVEMs for $x_1$ and $x_2$, not just $y$.
While doing this, we can take the opportunity to additionally explore yet another way to fit a model using the `tvem` function, by using the `select_tvem` function to choose the number of interior knots by a pseudolikelihood equivalent to an AIC or BIC criterion. "Pseudolikelihood" here means that the information criterion does not take within-subject correlation into account because we are fittin a marginal model agnostic to the exact correlation structure. The code below will fit the model with 0 to 5 interior knots, record the fit criterion for each potential choice, and select the one which gives the lowest (best) value of the criterion.
```{r,fig.width=4,fig.height=4}
model2 <- select_tvem(data=the_data,
formula = x1~1,
id=subject_id,
time=time,
max_knots=5)
print(model2)
plot(model2)
```
In the plot, the expected value of $X_1$ decreases from about 6 at the beginning of the interval of study to about 4 at the end. The highest number of interior knots in the range is selected by `select_tvem`. It might be reasonable to run the function again with a higher value of `max_knots`, to see whether there might be an even better fit that had not been found yet. Note that the individual knots and their locations do not have a special interpretation (as they do in some ``changepoint'' models); they are just a tool to make the fitted function more flexible than a simple polynomial would be. Let us try again with a maximum of ten interior knots.
```{r,fig.width=4,fig.height=4}
model2_selected_knots <- select_tvem(data=the_data,
formula = x1~1,
id=subject_id,
time=time,
max_knots=10)
print(model2_selected_knots)
plot(model2_selected_knots)
```
This is visually almost identical to the previous plot. The fit continues improving with higher and higher numbers of knots, but the differences in fit are so small no further insight would be gained by trying more. The `use_bic` option can be set to `TRUE` in order to use a more parsimonious criterion that might be minimized at a smaller number of knots.
```{r,fig.width=4,fig.height=4}
model2_selected_knots_bic <- select_tvem(data=the_data,
formula = x1~1,
id=subject_id,
use_bic=TRUE,
time=time,
max_knots=10)
print(model2_selected_knots_bic)
plot(model2_selected_knots_bic)
```
Now five knots have been selected, although again the plot does not look much different.
It would be good to repeat this analysis to plot the mean of $x_2$ over time in the same way.
## Time-varying effects of covariates
Next, we fit a TVEM model with covariates. We allow both $x_1$ and $x_2$ to potentially have ``time-varying effects'' (regression relationships with the response that change over time, that is, a potential interaction between time and the covariate, specified without the assumption of linearity). The implied mean model is $E(y|t, x_1(t),x_2(t)) =\beta_0(t)+\beta_1(t)x_1(t)+\beta_2(t)x_2(t)$ where $t$ is time in days.
```{r,fig.width=4,fig.height=4}
model3 <- tvem(data=the_data,
formula=y~x1+x2,
id=subject_id,
time=time)
print(model3)
plot(model3)
```
The output includes the formula which is sent to the back-end calculation package, the `mgcv` package by Simon Wood, which is described in depth in Wood (2017). It is not necessary to interpret the pieces of the formula in order to understand the output; it is mainly supplied for advanced users and potential debugging use. In summary, it says that $y$ depends on $x_1$ and $x_2$ as parametric terms, on a set of spline terms based on time, on a set of spline terms based on time multiplied by $x_1$, and on a set of spline terms based on time multiplied by $x_2$. Further details on interpreting the formula can be found in the `mgcv` documentation.
Holding $x_1$ and $x_2$ constant, the mean of $y$ seems to decline over time. The penalty function causes the the relationship to be estimated as linear because there is no clear evidence of nonlinearity. From the results, $x_1$ appears to have an increasingly positive relationship with $y$ over time. That is, $x_1$ and $y$ are more strongly positively related as time goes on.
$x_2$ also seems to predict $y$ (because the curve is not near zero), but the strength of the relationship does not change over time. That is, at any given time the predictive relationship between $x_2$ and $y$ seems to be about equally strong. Thus, we could reasonably fit a similar but simpler model, but with $x_2$ having a non-time-varying effect even though it has time-varying values.
## Time-invariant effects of covariates
The following code fits a model where $x_2$ is assumed to have a time-invariant effect but $x_1$ is allowed to have a time-varying effect.
```{r,fig.width=4,fig.height=4}
model4 <- tvem(data=the_data,
formula=y~x1,
invar_effect=~x2,
id=subject_id,
time=time)
print(model4)
plot(model4)
```
The implied mean model is $E(y|t) =\beta_0(t)+\beta_1(t)x_1(t) + \beta_2 x_2(t).$ Notice that only covariates with time-varying effects are listed in the formula argument. The invar_effect argument is used for covariates with time-invariant effects. Also notice that a tilde (`~`) sign is needed before the list of covariates with time-invariant effects, because it is treated as the right-hand side of a formula. If there were multiple covariates with time-invariant effects, they would be listed as follows: `~x2+x3+x4` which again follows R style for the right-hand side of a formula.
In the output, there is no longer a plot for the coefficient of $x_2$ as a function of time, because $\beta_2$ is considered constant over time even if $x_2$ varies. Instead, the estimate and estimated standard error of $\beta_2$ are given as text in the output from `print`. The plots for the intercept and `x1` coefficients do not look much different from before. The estimated intercept function now appears slightly nonlinear over time, but it is not clear whether this departure from a line is significant; in fact, it is not clear that the function is appreciably changing over time at all, due to the wide confidence intervals.
The seeming oscillations in confidence interval width in the plot do not have any particular interpretation; they are just an artifact of the locations of the knots. Also, the confidence intervals shown are approximate pointwise confidence intervals, much like those in the Methodology Center's `%TVEM` SAS macro. They are not simultaneous confidence bands, so they do not directly correct for multiple comparisons.
The main takeaway message from the analysis is the increasing $\beta_1(t)$ over time, suggesting an increasing association between $x_1$ and $y$, that is, some kind of interaction between $t$ and $x_1$ in predicting $y$.
# Example with a binary outcome variable
This example will be similar to the previous one, but with a binary response variable. The `tvem` library's simulation function will simulate binary $y$ if requested by an optional argument.
```{r}
set.seed(123)
the_data <- simulate_tvem_example(simulate_binary=TRUE)
```
The simulated dataset is similar to the previous example, but with binary $y$ (0=no, 1=yes) generated from a logistic model.
## Plotting the average log odds over time
We can plot the expected log odds over time, using a time-varying-intercept-only logistic model. The model assumes $\mathrm{logit}(E(Y|t))=\beta_0(t)$. The binary outcome is specified using the family argument as in R's `glm` function.
```{r}
model_binary1 <- tvem(data=the_data,
formula=y~1,
family=binomial(),
id=subject_id,
time=time)
```
As before, you can use the default option of an automatic penalty function, or you could specify the number of knots, or you could use automatic selection for the number of knots without a penalty.
## Including covariates
A model with covariates is similar to the previous example.
```{r}
model_binary2 <- tvem(data=the_data,
formula=y~x1,
invar_effect=~x2,
id=subject_id,
family=binomial(),
time=time)
print(model_binary2)
plot(model_binary2)
```
The implied mean model is $\mathrm{logit}(E(y|t)) =\beta_0(t)+\beta_1(t)x_1(t) + \beta_2 x_2(t).$ In this simulated example, the relationship of $x_1$ with $y$ appears to change over time, becoming more strong and positive as in the previous example.
With logistic TVEM, we can additionally plot exponentiated coefficients to get odds and odds ratios. In the plots below, the exponentiated intercept function shows the estimated odds of $Y=1$ at a given time, assuming $x_1=x_2=0$. The exponentiated beta function for $x_1$ shows the estimated odds ratio for $Y=1$ given a 1-unit increase in $x_1$ at a given time.
```{r}
plot(model_binary2, exponentiate=TRUE)
```
Although we do not present an example, Poisson count outcomes with a log link can be modeled similarly by specifying `family=poisson()` instead of `family=binomial()`. However, for overdispersed or zero-inflated counts, the Poisson distribution might not be appropriate. Currently, the only distributions supported by the `tvem` package are normal with identity link function for continuous numerical responses, binomial with logistic link for binary responses, and Poisson with log link for count responses.
# Analysis with Sampling Weights
In some studies, it is helpful to weight participants within the sample, so that some participants have more influence on the final estimates than others. This is often done to counteract some characteristic of the study design or implementation which could cause bias if not properly adjusted for. For example, if people from a particular group were more likely to be included in the sample than other people, certain kinds of statistics for the overall population can still be estimated in an unbiased way, but they might require that these oversampled people be weighted downwards.
The `tvem` package can account for sampling weights, by multiplying each participant's contribution to the pseudolikelihood by that participant's sample weight. Ordinarily, the `tvem` function will automatically normalize the weights, which means adjusting them so that they have an average of 1. Without this adjustment, the standard errors might be calculated incorrectly because the number of subjects could be miscounted. For example, a participant with a weight of 100 should not really count as 100 separate participants for purposes of determining sample size. If you are in an unusual situation where you need to leave the weights as specified, use the option `normalize_weights=FALSE`. However, if you are in any doubt, leave `normalize_weights` at its default of `TRUE`.
As a simplistic example, suppose a researcher records the height in centimeters of children on their tenth birthday, and twice a year afterwards until their sixteenth birthday. Suppose that in the population of interest, 50% of the children were male and 50% were female. However, suppose girls were less likely to agree to be in the study. Therefore, in the final sample, about 2/3 of the participants were boys, thus twice as many boys as girls. The investigator consults with a statistician and decides to give each girl twice the sampling weight of a boy, so that the overall weighted results give equal attention to boys and girls in calculating overall estimates. The following code will simulate a fairly realistic dataset from such a study.
```{r}
set.seed(12345)
n <- 100
simulated_dataset <- NULL
for (this_id in 1:n) {
age <- seq(10,16,by=.5)
female <- rbinom(1,1,1/3)
sample_weight <- ifelse(female==1,yes=2,no=1)
if (female==1) {
logistic_curve <- round(rnorm(1,0,1) + 135 +
runif(1,.9,1.1)*(160-135)/(1+exp(-.75*(age-13))) +
rnorm(length(age),0,.5),1)
} else {
logistic_curve <- round(rnorm(1,0,1) + 140 +
runif(1,.9,1.1)*(175-140)/(1+exp(-(age-13))) +
rnorm(length(age),0,.5),1)
}
simulated_dataset <- rbind(simulated_dataset,
data.frame(id=this_id,
sample_weight=sample_weight,
female=female,
age=age,
height=logistic_curve))
}
summary(simulated_dataset)
```
The simplest TVEM that can be done with this data is an estimate of the average growth curve over time, averaging over individual differences and ignoring sex. In this case, the intercept is the only time-varying coefficient and becomes the estimate of the fitted mean conditional on time. The following code fits this model with and without weights. A lower number of knots than usual is used, because there are relatively few distinct values of time (age) in the dataset.
```{r}
tvem1_unweighted <- tvem(data=simulated_dataset,
formula=height~1,
id=id,
num_knots=5,
time=age)
tvem1_weighted <- tvem(data=simulated_dataset,
formula=height~1,
id=id,
num_knots=5,
time=age,
weights=sample_weight)
plot(tvem1_weighted)
```
The plots of the two models look superficially similar so only one is shown above. It shows an increase in average height over time, following a slightly nonlinear trajectory that seems to accelerate around age 13.
However, a close examination of the fitted results shows that the weighted dataset leads to an estimated average height of about 0-3 centimeters shorter than the unweighted dataset, depending on age.
```{r}
print(summary((tvem1_unweighted$grid_fitted_coefficients$`(Intercept)`$estimate)))
print(summary((tvem1_weighted$grid_fitted_coefficients$`(Intercept)`$estimate)))
```
This is because the weighted analysis attempts to generalize to a population with equal numbers of girls and boys, while the unweighted analysis attempts to generalize to a population of mostly boys. Boys are taller than girls on average, during their teenage years.
This example is not particularly compelling as an illustration of the need for weights, because if weights depend only on sex then conditioning on sex in the model will the weights unnecessary. Once sex (dummy-coded as 1 for female and 0 for male) is represented as part of the estimated mean model, the weight becomes unnecessary. Note that sex has a time-varying effect on height even if sex itself remains constant, because the difference in height between males and females depends on age.
```{r}
tvem2_unweighted <- tvem(data=simulated_dataset,
formula=height~female,
id=id,
num_knots=5,
time=age)
tvem2_weighted <- tvem(data=simulated_dataset,
formula=height~female,
id=id,
num_knots=5,
time=age,
weights=sample_weight)
plot(tvem2_weighted)
print(summary((tvem2_unweighted$grid_fitted_coefficients$`(Intercept)`$estimate)))
print(summary((tvem2_weighted$grid_fitted_coefficients$`(Intercept)`$estimate)))
```
The plot at left shows the estimated expected height by age for participants with ``female==0`` (males), and the plot at right shows the signed difference between groups (very roughly speaking, the effect on height of being female). Of course, the plot at right does not mean that girls become shorter over time, because girls are also affected by the increasing intercept function. It only means that the average height of girls becomes smaller in comparison to boys of the same age. The fact that it is sometimes difficult to distinguish between within-person changes and between-person differences is one reason why TVEM results need to be interpreted thoughtfully.
In this example, including sex as a covariate renders it unnecessary to weight by sex. However, in a more complicated and realistic example, the probabilities of entering the sample might depend on factors such as city of residence, racial or ethnic background, economic class, etc., which may not be of interest as meaningful covariates in our model. In some public use datasets, the full set of variables which would be needed to construct the weights are not provided, for purposes of confidentiality of the original participants. If you are in doubt, don't try to construct sampling weights by yourself. However, if you are analyzing an existing dataset that has a weights variable (as in many large-scale social or economic surveys), the `weights` argument allows you to use them in `tvem`. As a caveat, more complex kinds of social surveys, involving clusters and subsampling, may not be supported correctly in this package.
# References
* de Boor, C. (1972). On calculating with B-splines. Journal of Approximation Theory, 6: 50–62.
* Eilers, P. H. C., & Marx, B. D. (1996). Flexible smoothing with B-splines and penalties. Statistical Science, 11: 89-121.
* Hastie, T, & Tibshirani, R. (1993). Varying-coefficient models. Journal of the Royal Statistical Socety, B, 55:757-796.
* Li, R., Dziak, J. J., Tan, X., Huang, L., Wagner, A. T., & Yang, J. (2017). TVEM (time-varying effect model) SAS macro users' guide (Version 3.1.1). University Park: The Methodology Center, Penn State. Available online at [https://www.methodology.psu.edu/downloads/tvem/])(https://www.methodology.psu.edu/downloads/tvem/) and archived at [https://github.com/dziakj1/MethodologyCenterTVEMmacros](https://github.com/dziakj1/MethodologyCenterTVEMmacros) and [https://scholarsphere.psu.edu/collections/v41687m23q](https://scholarsphere.psu.edu/collections/v41687m23q).
* Ruppert, D., Wand, M. P., & Carroll, R. J. (2003). Semiparametric regression.
Cambridge: Cambridge University Press.
* Tan, X., Shiyko, M. P., Li, R., Li, Y., & Dierker, L. (2012). A time-varying effect model for intensive longitudinal data. Psychological Methods, 17: 61-77.
* Wood, S. (2017). Generalized Additive Models: An Introduction with R, 2nd edition. Chapman and Hall/CRC.
# Acknowledgements
This software was developed by John Dziak at the Methodology Center at Pennsylvania State University, under the supervision of Donna Coffman of Temple University.
We thank Patricia Berglund, Danielle Cabel, and Yajnaseni Chakraborti for their very valuable help with testing the package and documentation.
The development of this software was part of a research project supported by National Institutes of Health grants P50 DA039838 from the National Institute of Drug Abuse and 1R01 CA229542-01 from the National Cancer Institute and the NIH Office of Behavioral and Social Science Research. Content is solely the responsibility of the authors and does not necessarily represent the official views of the funding institutions mentioned above.
This software is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.