01-ReviewRegression.Rmd

# Regression as Mean- and Covariance-Structure Analysis {#ch1}

This chapter prepares the reader to learn about structural equation modeling (SEM) by reviewing the fundamentals of ordinary least-squares (OLS) regression in the general(ized) linear modeling (GLM) framework. Different types of data are discussed, and some review of matrix algebra will be incorporated into the regression review.

## What Are Mean and Covariance *Structures*?

Another (older) name for an SEM is *covariance structure analysis*.  What is meant by this?

Covariance is the linear relationship we observe between 2 variables (more on this below).  There might be many reasons why 2 variables covary, such as 

- one variable influences the other
- common causes influence them both
- both variables appear in a chain or web of causal effects

A covariance can be partitioned into components that are due to these different reasons.  The decomposition of covariances will be explained after introducing path analysis, but recall how many statistical procedures you may already have learned about, which partition variance into components:

- Regression: explained ($R^2$) vs. unexplained (residual) variance
- ANOVA: variance between vs. within groups
- Multilevel modeling: Level-1 v. -2 variance
    - and (un)explained at each level
- Between vs. within multiple imputations of missing data 

Regression models in the GLM framework primarily analyze how the mean and variance of the outcome are structured. 

- A grand mean ($\bar{y}$) is an unconditional expected value, whereas a predicted value from a regression model ($\hat{y}$) is an expected value under a particular condition (e.g., $\hat{y}=\beta_0$ when $X=0$).  We calculate these conditional expectations as the intercept ($\beta_0$) plus each predictor times its slope: $\sum \beta_j x_j$.  
- Variance is partitioned into (un)explained variance (e.g., $R^2$), and its structure can be further explored via unique contributions of each predictor (e.g., partial $\eta^2$ in AN(C)OVA).

SEM is a multivariate method capable of partitioning means and variances of each variable, as well as partitioning the covariance between any pair of variables.


### Types of Data Used in SEM {#ch1-1-1}

SEM can be seen as a generalization of the GLM for multivariate data. But rather than merely allowing multiple predictors and multiple outcomes, any variable can operate **both** as a predictor and an outcome in the same model.  This allows us to model complex relationships among variables, including chains of causation (i.e., mediation). 

Estimation of a GLM involves minimizing discrepancies (differences) between observed ($y$) and expected ($\hat{y}$) *casewise* values (i.e., an individual subject's score on an outcome variable $y$).  These discrepancies are called "residuals" ($y - \hat{y}$).  In contrast, estimation of a SEM involves minimizing discrepancies between observed and expected *summary statistics* (i.e., the modeled variables' means and covariance matrix, see table below).  This is why another (older) name for an SEM is *covariance structure analysis* (when only trying to explain why variables are related to each other), or mean and covariance structure (MACS) analysis when means are also modeled.


| | **Casewise observations** (GLM) |    **Mean vector** (SEM)  | **Covariance matrix** (SEM) |
|:-------------|:------------------:|:-------------------------:|:---------------------------:|
| Observations | $y_i$              | $\bar{y}$                 | $\mathbf{S}$                |
| Expectations | $\hat{y}_i$        | $\widehat{\mu}$           | $\widehat{\Sigma}$          |
| Residuals    | $y_i - \hat{y}_i$  | $\bar{y} - \widehat{\mu}$ | $\mathbf{S} - \widehat{\Sigma}$ |


Summary statistics can be calculated from observed raw data, so raw data can be provided directly to SEM software for analysis.  But if data are complete and come from a multivariate normal distribution, summary statistics can also be passed directly to the SEM software, which has advantages.  Unlike raw data, summary statistics do not compromise the privacy of the research participants, so summary statistics can be reported in published empirical research. That allows research-consumers (like you) to have access to the same summary statistics to replicate the results or to fit a different model (e.g., that represents a competing theory). Thus, this chapter will show how to import both raw and summary data, as well as how to calculate summary statistics from raw data so that they can be reported along with results.  First, we provide some information about interpreting the summary statistics most relevant for SEM.


### Interpreting Covariance

Whether in a GLM or as part of a larger SEM, a regression model includes both mean-structure parameters (intercepts) and covariance-structure parameters (slopes, residual variance).  Generally speaking, covariance-structure parameters are informative about how variables are related to each other, and even the variances of (and covariances among) predictors are pivotal to estimating regression slopes.  In GLM, this is not immediately apparent from regression output, so later in this chapter we will demonstrate how to manually calculate regression slopes from raw data, revealing the role of summary statistics in estimating slopes.  But first, **what is covariance?**

Recall that a variable's variance ($\sigma^2_y$) is the mean of squared deviations from $\bar{y}$, which quantifies how individual scores vary:

$$\text{Var}(y) = \frac{\sum_{i=1}^N (y_i - \bar{y})^2}{N-1} = \frac{\sum_{i=1}^N (y_i - \bar{y})(y_i - \bar{y})}{N-1}$$

Covariance ($\sigma_{xy}$) quantifies how $x$ and $y$ are linearly related (*co*vary means "vary together").

$$\text{Cov}(x,y) = \frac{\sum_{i=1}^N (x_i - \bar{x})(y_i - \bar{y})}{N-1}$$

Thus, a variance can be seen as a special case of covariance, when $x=y$.

Its absolute value is difficult to interpret because it is in a squared metric. For a variance, we usually overcome this limitation by interpreting its square-root: the standard deviation ($\sigma$, or sample estimate *SD*), expressed in the original units of measurement.  In contrast, the square-root of a *co*variance does not have a meaningful or practical interpretation.

However, the covariance between 2 variables cannot exceed the product of their *SD*s ($\pm \sigma_x \sigma_y$), so we can standardize a covariance via division of $\sigma_{xy}$ by $\sigma_x \sigma_y$.  This limits the range to $\pm 1$, and we call the standardized covariance a *correlation coefficient*:

$$\rho_{xy} = \frac{\sigma_{xy}}{\sigma_x \sigma_y}$$

For $>2$ variables in a data set $\mathbf{Y}$, we can simultaneously calculate all (co)variances using matrix algebra:

1. Mean-center the data-matrix
    - $\mathbf{Y} - \bar{\mathbf{Y}}$
2. "Square" it by (pre)multiplying it by its transpose
3. Divide the result by (1 minus) the sample size

$$\Sigma = \frac{1}{N-1} (\mathbf{Y} - \bar{\mathbf{Y}})^{'} (\mathbf{Y} - \bar{\mathbf{Y}})$$

Each off-diagonal cell of $\Sigma$ contains the covariance between the row-variable and the column-variable.  A covariance matrix is symmetric above/below the diagonal because the covariance (or correlation) between $x$ and $y$ is the same as between $y$ and $x$.  On the diagonal of $\Sigma$, the row and column variables are the same---how does a variable covary with itself? The diagonal contains each variable's variance.  Some authors refer to $\Sigma$ as a "variance--covariance matrix", but that is redundant because variances appear on the diagonal by definition (see Cov($x,y$) formula above).  So the term "covariance matrix" is sufficiently descriptive.

Standardizing a covariance (i.e., scaling by both *SD*s) is only one way to facilitate its interpretation. If it makes sense to think of one variable $x$ affecting another variable $y$, then another way to scale the covariance would be to divide by variance of $x$.

$$\beta_{yx} = \frac{\sigma_{xy}}{\sigma^2_x} = \frac{\sigma_{xy}}{\sigma_x \sigma_x}$$

This provides a familiar interpretation from simple regression of $y$ on $x$:  the slope ($\beta_{yx}$) is interpreted as the change in $y$ per unit $x$.
Recall that a standardized simple regression slope ($\beta^*_{yx}$) is the correlation $\rho_{xy}$.  The transformation below reveals why:

```{=tex}
\begin{align*}
\beta^*_{yx} &= \beta_{yx} \times \frac{\sigma_x}{\sigma_y} \\
&= \frac{\sigma_{xy}}{\sigma_x \sigma_x} \times \frac{\sigma_x}{\sigma_y} \\
&= \frac{\sigma_{xy}}{\sigma_x \sigma_y} (= \rho_{xy})
\end{align*},
```
where one pair of $\sigma_x$ terms cancels out. This transformation is how slopes can be standardized even when there are multiple predictors. Partial regression slopes can also be calculated from summary statistics, but the formulas are more complicated because they take into account how the predictors are related to each other.  

The take-home message is that (un)standardized slopes and (zero-order or partial) correlations are just different ways of scaling a covariance to make it more interpretable.  Like covariances, correlations are agnostic about whether one variable affects another, but slopes imply a causal direction (although other conditions must be met to draw a valid causal inference).


## Importing Data for SEM

This section covers importing both raw data and summary statistics, either of which can be analyzed by SEM software.  We will utilize some special features in the open-source R package for SEM called [`lavaan`](https://lavaan.ugent.be/) [(Rosseel, 2012)](http://dx.doi.org/10.18637/jss.v048.i02), which is also the primary software we use for conducting SEM analyses in later chapters.

### Installing `lavaan`

To install the `lavaan` package you can type the following command in the R console:

```{r install, eval=FALSE}
install.packages("lavaan", dependencies = TRUE)
```

Depending on your [R](https://www.r-project.org/)([Studio](https://www.rstudio.com/)) settings, this might open a window where you have to select a [CRAN](https://cran.r-project.org/) mirror (select "[your country], [closest city]") prior to installing the package, including all additional packages that it depends on. You only need to install the package once (on a specific computer), but you must also "load" the library to access its functionality (similar to first installing an app, then opening the app to use it):

```{r load, message=FALSE}
library(lavaan)
```

Every time you start R you need to use this command to activate the `lavaan` package into the current R workspace. Therefore, it is advisable to start every script with this command.

Sometimes, it will be necessary to install the *development version* of `lavaan` before the next version is made available on CRAN. For example, if the developer has fixed a bug or introduced a new feature in the source code (hosted on [GitHub](https://github.com/yrosseel/lavaan/)), the development version can be installed using the `remotes` package:

```{r install dev, eval=FALSE}
install.packages("remotes") # if necessary
remotes::install_github("yrosseel/lavaan")
```


### Importing Raw Data

Typically, data are already stored in an external file with a predefined format, such as SPSS (\*.sav) or Excel (\*.xlsx), and there are some pre-specified functions in R that can read in data from such type of formats, although they require the installation of specific packages. As an alternative, you can store the data in a plain text (with .txt or .csv extensions), where each line represents the observed responses of one individual. Spaces, commas, semicolons or tabs can be used to separate the individual responses. Some programs (like SPSS) can also export data to these types of file formats.

```{r eval=FALSE}
read.table("data.txt", header = TRUE)
read.spss("data.sav", to.data.frame = TRUE)	# requires the package foreign
read.xls("data.xls")	# requires the package gdata
as.data.frame(read_excel("data.xlsx"))	# requires the package readxl
```

This chapter uses data from a tutorial published on the [Social Change Lab's web site](http://www.socialchangelab.net/uploads/9/8/7/7/98771854/moderated_mediation_example_write-up.pdf).

```{r import data-01, message=FALSE}
dat <- foreign::read.spss("demoData/MODMED.sav", to.data.frame = TRUE)[c(2, 3, 5, 7)]
head(dat)
```

`MOOD` indicates experimental conditions

-   $+1$ = positive mood induction
-   $-1$ = neutral mood (control condition)

`NFC` = need for cognition, `POS` = positive thoughts, and `ATT` = attitude

### Calculate Summary Statistics from Raw Data

To report summary statistics of your modeled variables, they can be calculated using the `colMeans()` function for means, the `cov()` function for a covariance matrix, and the `nrow()` function for the sample size

-   *Note*: Counting the number of rows **assumes you have complete data**

```{r summaries}
(M <- colMeans(dat)) # mean vector
(S <- cov(dat)) # covariance matrix
(N <- nrow(dat)) # complete-data sample size
```

Covariances are explained in more detail in a later section, but it is sufficient here to understand that the covariance between variables $x$ and $y$ (i.e., $\sigma_{xy}$) is their unstandardized correlation coefficient ($r_{xy}$). That is, a correlation is a covariance between $z$ scores. Because correlations are easier to interpret than covariances, it is common to report the means, *SD*s, and correlations. Using the *SD*s, readers can rescale correlations to covariances in the orignal units of measurement (i.e., $\sigma_{xy} = \sigma_x \times \sigma_y \times r_{xy}$). The `cov2cor()` function can standardize a covariance matrix, or you can use the `cor()` function directly.

```{r SD+cor}
(SD <- sapply(dat, sd))
sqrt(diag(S)) # SDs == square-roots of variances 
cov2cor(S) # standardize the covariance matrix, or use cor()
(R <- cor(dat))
```

Because covariance and correlation matrices are symmetric (and diagonal elements of a correlation matrix are 1 by definition), both could be reported in a single table. For example, the upper triangle could be correlations, and the lower triangle (including the diagonal) could be (co)variances.

```{r cov+cor}
printS <- S
printS[upper.tri(R)] <- R[upper.tri(R)]
printS # not symmetric, correlations in upper triangle
```

#### Calculate Summary Statistics for Multiple Groups

In SEM, it is common to estimate model parameters simultaneously in multiple groups (i.e., multigroup SEM or MG-SEM). In our example data, the `MOOD` variable is a 2-group variable, so we could create a `factor` from the numeric codes:

```{r factor}
dat$mood.f <- factor(dat$MOOD, levels = c(-1, 1),
                     labels = c("neutral","positive"))
```

Then we can calculate the summary statistics separately within each group.

```{r group summaries}
modVars <- c("ATT","NFC","POS")   # modeled variables
neuRows <- dat$mood.f == "neutral"  # rows for neutral group
posRows <- dat$mood.f == "positive" # rows for positive group
gM <- list(neutral  = colMeans(dat[neuRows, modVars] ),
           positive = colMeans(dat[posRows, modVars] ))
gS <- list(neutral  = cov(     dat[neuRows, modVars] ),
           positive = cov(     dat[posRows, modVars] ))
gN <- table(dat$mood.f)
```

To fit a MG-SEM in `lavaan`, the summary statistics must be a list with one (mean vector or covariance matrix) per group, as well as a vector of group sample sizes.

```{r print summaries}
gM
gS
gN
```

### Importing Summary Data

#### Full Covariance Matrix

If we are importing summary data (e.g., from an article that reported them), we can type our data directly to our R script. Suppose the values above from `S` were rounded to 3 decimal places:

```{r type S values}
covVals <- c( 1.010, -0.032,  4.355,   6.841, 
             -0.032,  1.973,  0.795,   2.599, 
              4.355,  0.795, 69.263,  87.914,
              6.841,  2.599, 87.914, 282.060)
covVals
```

If these values were saved as plain text, we could also read the data from a file using the `scan()` function, which returns a vector that contains the stored values in the file `"values.txt"`.

```{r read S values, eval=FALSE}
covVals <- scan("values.txt")
```

Either way, we can place these values from the full covariance matrix into a matrix using the `matrix()` function, specifying how many rows and columns there are:

```{r store S values}
(newS <- matrix(data = covVals, nrow = 4, ncol = 4))
```

We can give names to the variables in the covariance matrix by using the `dimnames=` argument of the `matrix()` function, or by using the `dimnames()` function after creating the matrix. Let's save the variable names in an object `obsnames`, then use them to assign names.

```{r dimnames}
obsnames <- c("MOOD", "NFC", "POS", "ATT")
## providing names when creating matrix
newS <- matrix(covVals, nrow = 4, ncol = 4,
               dimnames = list(obsnames, obsnames))
## or assign afterward, all at once
dimnames(newS) <- list(obsnames, obsnames) # (rows, columns)
## or one dimension at a time (useful for asymmetric matrix)
rownames(newS) <- obsnames
colnames(newS) <- obsnames
```

##### Rescaling a Correlation Matrix

If the imported matrix is instead a full correlation matrix (i.e., assuming all variables have $SD=1$), then it can be transformed back into a covariance matrix using the $SD$s, so variables will be in their original units. This is important because an SEM fitted to a correlation matrix will have biased $SE$s and inflated Type I error rates unless complex constraints are imposed.

```{r type R values}
## store values from correlation matrix
corVals <- c(  1,    -0.023, 0.521, 0.405,
              -0.023, 1,     0.068, 0.110,
               0.521, 0.068, 1,     0.629, 
               0.405, 0.110, 0.629, 1)
newR <- matrix(data = corVals, nrow = 4, ncol = 4,
               dimnames = list(obsnames, obsnames))
newR
## store SDs as a diagonal matrix
(SDs <- diag(SD))
## transform correlations to covariances
scaled.S <- SDs %*% newR %*% SDs
round(scaled.S, 3)
## matches covariance matrix (within rounding error)
newS
```

Note that rescaling the correlation matrix to be a covariance matrix required pre- and postmultipication of the (diagnal matrix of) $SD$s. The operator for matrix multiplication is `%*%` (the normal scalar multiplication operator `*` would simply multiply each cell of one matrix with the corresponding cell of another matrix). The table below gives an overview of commonly used functions in matrix algebra in R.

| **Symbol** | **Function**                        | **Example in R** |
|------------|-------------------------------------|------------------|
| $A+B$      | Addition                            | `A + B`          |
| $A-B$      | Subtraction                         | `A - B`          |
| (none)     | Elementwise Multiplication (in R)   | `A * B`          |
| (none)     | Elementwise Division (in R)         | `A / B`          |
| $AB$       | Matrix Multiplication               | `A %*% B`        |
| $A^{-1}$   | Inverse (enables "division" analog) | `solve(A)`       |
| $A^{'}$    | Transpose (rows become columns)     | `t(A)`           |
| $|A|$      | Determinant (generalized variance)  | `det(A)`         |

To review basic matrix algebra, consult a linear algebra text or an appendix of some SEM textbooks (e.g., Bollen, 1989; Schumacker & Lomax, 2016).

#### Upper/Lower Triangle of a Covariance Matrix

As a covariance matrix is a symmetric matrix, we do not need to provide the complete matrix. Instead, we can import only the nonredundant information from only the lower (or upper) triangle of the covariance matrix, then use the `lavaan` function `getCov()` to create the full matrix. By default, `getCov()` expects the values from the lower triangle, and it will read values row-by-row (i.e., left-to-right, top-to-bottom), including the diagonal elements (`diagonal = TRUE` is the default argument).

$$
\begin{bmatrix} 
 1.010 & & & \\ 
-0.032 & 1.973 & & \\
 4.355 & 0.795 & 69.263 & \\
 6.841 & 2.599 & 87.914 & 282.060
\end{bmatrix}
$$

This is equivalent to reading the upper triangle column-by-column (i.e., top-to-bottom, left-to-right). One can also add a vector with the `names=` of the observed variables. It returns the complete covariance matrix, including row and column names. The following code can be used to create a full matrix that is equal to `S` and `newS` above, using only the values from the lower triangle of the matrix.

```{r lower S}
lowerS <- c( 1.010,
            -0.032, 1.973, 
             4.355, 0.795, 69.263, 
             6.841, 2.599, 87.914, 282.060)
getCov(x = lowerS, names = obsnames)
```

Sometimes you will find the upper triangle reported in a published paper instead of the lower triangle. Because `getCov()` **only reads values row-by-row** (i.e., left-to-right, top-to-bottom), reading an upper-triangle requires changing the argument to `lower = FALSE`.

$$
\begin{bmatrix} 
1.010 & -0.032 &  4.355 &   6.841  \\
      &  1.973 &  0.795 &   2.599  \\
      &        & 69.263 &  87.914  \\
      &        &        & 282.060  
\end{bmatrix}
$$

```{r upper S}
upperS <- c(1.010, -0.032, 4.355,  6.841, 
                    1.973, 0.795,  2.599, 
                          69.263, 87.914,
                                 282.060)
getCov(x = upperS, names = obsnames, lower = FALSE)
```

##### Upper/Lower Triangle of a Correlation Matrix

Quite frequently, in published papers the covariance matrix is not provided, but instead a correlation matrix and $SD$s are given separately. The `getCov()` argument `sds=` can be used to automatically rescale the correlation matrix. Because the diagonal of a correlation matrix is always 1, it is not necessary to include the diagonal values.

$$
\begin{bmatrix} 
 1 & & & \\ 
-0.023 & 1 & & \\
 0.521 & 0.068 & 1 & \\
 0.405 & 0.110 & 0.629 & 1
\end{bmatrix}
$$

In this case, tell `lavaan` that the diagonal entries are omitted using the `diagonal=FALSE` argument. The following syntax creates the complete covariance matrix that was used in the previous examples from the correlations and $SD$s.

```{r lower R}
getCov(x = c(-0.023, 0.521, 0.068, 0.405, 0.110, 0.629),
       diagonal = FALSE, sds = SD, names = obsnames)
```

If you are wondering whether the correlations appear in the correct order in the matrix, you can first leave the $SD$s out and check that the cells of the correlation matrix are in the correct order. If everything looks correct, then you can add the $SD$s. Covariance/correlation matrices are also symmetric, so it is also important to check that the lower triangle is a reflection of the upper triangle (i.e., Row-$x$ Column-$y$ of the matrix should contain the same value as Row-$y$ Column-$x$). The R function `isSymmetric()` can perform this check for you:

```{r isSymmetric}
isSymmetric(S)
```

## Regression Using Matrix Algebra

This section is more technical than most remaining chapters will be.  The main goal of this section is to demonstrate how GLM parameters can be estimated using raw data and summary statistics, revealing how these approaches are equivalent. Familiarity with the basics of matrix algebra will continue to remain important throughout later chapters as well, not only because much of SEM is expressed in matrix notation, but because later chapters will demonstrate certain methods (e.g., standardizing an entire matrix of regression slopes) using matrix algebra.


### Linear Regression Models {#ch1-3-1}

The GLM expresses an outcome $y$ as a linear combination (weighted sum) of predictors:

```{=tex}
\begin{align*}
y_i &= \beta_0 + \beta_1 x_{i1} + \ldots + \beta_p x_{ip} &+& \varepsilon_i \\
    &= \beta_0 + \sum_{j=1}^p \beta_j x_{ij} &+& \varepsilon_i \sim \mathcal{N}(0, \sigma)
\end{align*}
```

- The *deterministic* component of $y$ can be predicted or explained by $X$
- The *stochastic* component of $y$ includes all other unmodeled effects (omitted variables) and sources of error ($\varepsilon$)

This is analogous to a recipe for the outcome: How much of $x_1$ and $x_2$ do you need to recreate $y$?

The notation above uses a subscript *i* as an "index variable" for subject $i = 1, \dots, N$.  There are other notations, such as the full matrix notation showing each subject's values of the outcome $y$ and predictors $j = 1, \dots, p$ in columns of the matrix $\mathbf{X}$:

```{=tex}
\begin{equation*}
  \begin{bmatrix}
    y_1 \\
    y_2 \\
    \vdots \\
    y_N
  \end{bmatrix} =
  \begin{bmatrix}
      1 & x_{1,1} & x_{1,2} & \dots  & x_{1,p} \\
      1 & x_{2,1} & x_{2,2} & \dots  & x_{2,p} \\
 \vdots & \vdots  & \vdots  & \ddots & \vdots  \\
      1 & x_{N,1} & x_{N,2} & \dots  & x_{N,p}
  \end{bmatrix}
  \begin{bmatrix}
    \beta_0 \\
    \beta_1 \\
    \beta_2 \\
    \vdots \\
    \beta_p
  \end{bmatrix} +
  \begin{bmatrix}
    \varepsilon_1 \\
    \varepsilon_2 \\
    \vdots \\
    \varepsilon_N
  \end{bmatrix} 
\end{equation*}
```

The shorthand notation simply uses bolded symbols for the matrices above: $\mathbf{y} = \mathbf{X} \mathbf{\beta} + \mathbf{\varepsilon}$

The ordinary least-squares (OLS) estimator of $\beta$ minimizes the sum of squared residuals (equivalent to maximizing $R^2$):

$$\beta = (\mathbf{X}^{'} \mathbf{X})^{-1} \mathbf{X}^{'} \mathbf{y}$$

The OLS estimator operates on the raw data: the matrix of predictors ($\mathbf{X}$) and the vector of outcomes ($\mathbf{y}$).  But recall from the previous section that if the data are mean-centered, then $\mathbf{X}^{'} \mathbf{X}$ is the covariance matrix of the predictors.  In that case, there would be no vector of ones in the first column of $\mathbf{X}$, and the first element of $\beta$ (the intercept) can be omitted because it would be zero whenever $\mathbf{X}$ and $\mathbf{y}$ are mean-centered.  The syntax examples below calculate $\mathbf{\Sigma}_\mathbf{X}$ both manually and using the `cov()` function to demonstrate their equivalence.

```{r cov matrix X}
## Manual calculation
n <- nrow(dat) - 1
Xc <- scale(dat[c("MOOD","NFC","POS")],
            center = TRUE, scale = FALSE)
t(Xc) %*% Xc / n
## Automated function
cov(dat[c("MOOD","NFC","POS")])
```


Likewise, $\mathbf{X}^{'} \mathbf{y}$ captures the covariances of predictors with the outcome:
```{r cov Xy}
## Manual calculation
t(Xc) %*% dat$ATT  / n
## Automated function
cov(x = dat[c("MOOD","NFC","POS")],
    y = dat$ATT)
```


Thus, even without access to raw data, the OLS estimates of slopes can be obtained using the data's summary statistics:
```{r beta from cov}
Xcov <- cov(dat[c("MOOD","NFC","POS")])
Xy <- cov(x = dat[c("MOOD","NFC","POS")],
          y = dat$ATT)
solve(Xcov) %*% Xy # calculate slopes from (co)variances
```

These slopes match the estimates from R's built-in linear-modeling function:

```{r multiple reg}
mod <- lm(ATT ~ MOOD + NFC + POS, data = dat)
coef(mod)
```

### Intercepts and Means

The `lm()` output above also provides an intercept because the data were not centered.  In this case, the matrix $X$ includes a constant (all 1s) in the first column, so its covariance matrix no longer resembles the covariance matrix of the predictors alone.  However, the covariation among predictors (and outcome) are still captured, and the column of 1s effectively "partials out" the mean from the predictors and outcome.  

In the simplest case without any predictors (an intercept-only model), it is easier to see how the vector of 1s captures mean-structure information.  Recall the formula for an arithmetic mean is the sum of $y$ scores, divided by the number of $y$ scores: $\bar{y} = \frac{1}{N} \sum y$.  For the intercept-only model, the intercept ($\beta_0$) is the mean:

- $(\mathbf{X}^{'} \mathbf{X})^{-1} = (\mathbf{1}^{'} \mathbf{1})^{-1} = \frac{1}{N}$
- $\mathbf{X}^{'} \mathbf{y} = \sum y$

The syntax example below demonstrates the equivalence:
```{r mean from OLS}
## using matrix algebra
y   <- dat$ATT
ONE <- rep(1, times = length(y))
solve(t(ONE) %*% ONE) %*% t(ONE) %*% y
## fitted linear model with only an intercept
coef(lm(y ~ 1))
## calculate the mean
mean(y)
```


Finally, we will apply the OLS formula ($\beta = (\mathbf{X}^{'} \mathbf{X})^{-1} \mathbf{X}^{'} \mathbf{y}$) to the full model with uncentered raw data (i.e., intercept included):
```{r XtXiXtY}
X <- cbind(`(Intercept)` = ONE, # first column is constant
           as.matrix(dat[c("MOOD","NFC","POS")]))
solve(t(X) %*% X) %*% t(X) %*% y # beta
```

Which again matches results from the `lm()` function:
```{r repeat reg}
coef(mod)
```

When we add a predictor to the model, the intercept becomes a conditional mean (i.e., the expected value when each predictor = 0) rather than the "marginal" or "grand" mean.


### Categorical Predictors

Recall also that groups $g = 1, \ldots, G$ can be represented using numeric codes, very often 1s, such as dummy codes.  For example, a dummy code for Group $g$ indicates whether someone belongs to Group $g$ (1) or not (0).  Alternative coding strategies include orthogonal contrast codes or effects coding, the latter of which was how our `MOOD` variable was coded to indicate experimental conditions:

- $+1$ = positive mood induction
- $-1$ = neutral mood (control condition)

```{r MOOD table}
table(dat$MOOD)
```

The algebra is identical, capturing how dummy codes covary with each other and with the outcome.  However, the set of all $G$ dummy codes would be perfectly multicollinear with the constant used to estimate the intercept (i.e., the column of 1s is exactly equal to the sum of the dummy-coded columns).  This is why we typically estimate slopes for $G-1$ dummy codes, treating the last one as the reference group.  Thus, each dummy-code's slope represents how that group's mean differs from the reference group's mean.

#### Regression with Group-Specific Intercepts

Rather than choosing a baseline group, we can omit the intercept from the model by dropping the vector of 1s from $X$.  This prevents multicollinearity among the dummy codes and the constant, so `lm()` will include all $G$ dummy codes (not $G-1$). 

```{r no intercept}
dat$mood.f <- factor(dat$MOOD, levels = c(-1, 1),
                     labels = c("neutral","positive"))
coef( lm(POS ~ -1 + mood.f, data = dat) )
```

Essentially, this model gives each group its own intercept, which is also the group-mean when there are no other predictors in the model
```{r group means}
aggregate(POS ~ mood.f, data = dat, FUN = mean)
```

Later in this course, you will learn about *multigroup SEM*, which is when the same SEM is fitted separately (but simultaneously) to 2 or more groups.  Multigroup SEM is analogously advantageous, but (unlike in the GLM) variances can also differ across groups, so we do not need to assume homoskedasticity when using multigroup SEM.

Slopes can also be estimated separately per group when the intercept is dropped from the model.  This is specified by multiplying the predictor by each dummy code.
```{r group slopes}
## no "main/simple effect" of NFC without a reference group
mod1 <- lm(POS ~ -1 + mood.f + mood.f:NFC, data = dat)
coef(mod1)
```

This is equivalent to estimating the effect of `NFC` separately in each group.
```{r separate groups}
mod.neu <- lm(POS ~ NFC, data = dat,
              subset = dat$mood.f == "neutral")
coef(mod.neu)
mod.pos <- lm(POS ~ NFC, data = dat,
              subset = dat$mood.f == "positive")
coef(mod.pos)
```


We will discuss (dis)advantages of the multigroup approach in the following chapter, after introducing regression models as SEM.


## Summary

This chapter introduced the idea of analyzing mean and covariance structures, which partition descriptive statistics (means and (co)variances) into meaningful components.  This enables researchers to impose a hypothesized structure on why variables are related to each other.  The next chapter will introduce the SEM framework, which was designed for researchers to specify such theoretical structures as statistical models, and fit them to data in order to estimate and test parameters of interest.  


## References {-}

Bollen, K. A. (1989). *Structural equations with latent variables*. Wiley. <https://doi.org/10.1002/9781118619179>

Rosseel, Y. (2012). `lavaan`: An R package for structural equation modeling. *Journal of Statistical Software, 48*(2), 1--36. <https://doi.org/10.18637/jss.v048.i02>

Schumacker, R. E., & Lomax, R. G. (2016). *A beginner's guide to structural equation modeling* (4th ed.). Lawrence Erlbaum. <https://doi.org/10.4324/9781315749105>