+
This means we are happy to adopt the original reduced model.
+
Diagnostics Summary
+
The main tools we will use to validate regression assumptions are,
+
+- Plots involving standardized residuals and/or *fitted values.
+- Determine leverage points
+- Determine which (if any) of the data points are outliers.
+- Asses the effect of each predictor variable, having adjusted for the effect of other predictor variables using added variable plots.
+- Asses the extent of collinearity among the predictor variables using variance inflation factors.
+- Examine whether the assumption of normality of error and constant error variance is reasonable.
+
+
Leverage Points and Residuals
+
Remember from simple linear regression, leverage points are points that have extreme values of the predictor variable.
+
$$
+\mathbf{\hat{Y}} = \mathbf{X} \hat{\beta} = \mathbf{X}(\mathbf{X}^T \mathbf{X})^{-1} \mathbf{X}^T \mathbf{Y} = \mathbf{H} \mathbf{Y},
+$$
+
where $\mathbf{H} = \mathbf{X}(\mathbf{X} \mathbf{X})^{-1} \mathbf{X}^T$ (also called the hat matrix).
+
Let $h_{ij}$ be the $(i, j)$-entry of $\mathbf{H}$, then,
+
$$
+\hat{Y_i} = h_{ii} Y_i + \sum_{j \neq i} h_{ij} Y_j
+$$
+
The rule of thumb for leverage points are,
+
$$
+h_{ii} > 2 \times \text{average}(h_{ii}) = 2 \times \frac{p+1}{n}
+$$
+
The residuals are defined as,
+
$$
+\mathbf{\hat{e}} = \mathbf{Y} - \mathbf{\hat{Y}} = (\mathbf{I} - \mathbf{H}) \mathbf{Y}
+$$
+
We can show that,
+
$$
+\text{Var}(\mathbf{\hat{e}} | \mathbf{X}) = \sigma^2(\mathbf{I} - \mathbf{H}),
+$$
+
and that the standardized residuals are,
+
$$
+r_i = \frac{\hat{e_i}}{s \sqrt{1 - h_{ii}}}
+$$
+
where,
+
$$
+s = \sqrt{\frac{\sum_{i=1}^n \hat{e_i^2}}{n - p - 1}}
+$$
+
Any pattern in a residual plot indicates that an incorrect model has been fit, but the pattern in general does not provide direct information on how the model is misspecified.
+
Added Variable Plots
+
Assume we originally have the model,
+
$$
+\mathbf{Y} \mathbf{X} \beta + \mathbf{e}.
+$$
+
With an additional predictor, we now consider,
+
$$
+\mathbf{Y} = \mathbf{X} \beta + \mathbf{Z} \alpha + \mathbf{e},
+$$
+
where,
+
$$
+\mathbf{Z} =
+\begin{bmatrix}
+z_1 \newline
+z_2 \newline
+\vdots \newline
+z_n
+\end{bmatrix}
+$$
+
The procedure is as follows,
+
+- Perform Regression
+$$
+\mathbf{Y} = \mathbf{X} \beta + \mathbf{e}
+$$
+
+
to get the residual $\mathbf{e}_{\mathbf{Y}.\mathbf{X}}$.
+
+- Perform Regression
+$$
+\mathbf{Z} = \mathbf{X} \beta + \mathbf{e}
+$$
+
+
to get residual $\mathbf{e}_{\mathbf{Z}.\mathbf{X}}$.
+
+- Plot $\mathbf{e}_{\mathbf{Y}.\mathbf{X}}$ (on $y$-axis)
+
+
against $\mathbf{e}_{\mathbf{Z}.\mathbf{X}}$ (on $x$-axis).
+
+
Transforming only $Y$ using inverse response plot.
+
Assume the true model is actually,
+
$$
+Y = g(\beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + e),
+$$
+
the inverse model is thus,
+
$$
+g^{-1}(Y) = \beta_0 + \beta_1 x_1 + \ldots + \beta_p x_p + e
+$$
+
To use inverse response plot, an important assumption is that the predictors are pairwise linearly related.
+
Example: Defective rates.
+
We want to develop a model for number of defectives based on $x_1$: temperature, $x_2$: density, and $x_3$: production rate.
+
To get the pairwise linear relationship, we can use the scatterplot matrix.
+
pairs(~Defective + Temperature + Density + Rate)
+
+
To get the inverse response plot, we can use the following code.
+
m1<-lm(Defective~Temperature+Density+Rate)
+inverseResponsePlot(m1)
+
+
Collinearity of Predictors
+
When higly correlated predictor variables are included, they are effectively carrying very similar information about the response variable.
+
Thus, it is difficult for least squares to distinguish their separate effects on the response variable.
+
Some of the coefficients in the regression model are of the opposite sign than expected.
+
Consider the multiple regression model,
+
$$
+Y = \beta_0 + \beta_1 x_1 + \ldots \beta_p x_p + e.
+$$
+
Let $R_j^2$ be the coefficient of determination $R^2$ obtained when regressing $x_j$ on other predictors.
+
Then it can be shown that,
+
$$
+\text{Var}(\hat{\beta_j}) = \frac{1}{1 - R_j^2} \frac{\sigma^2}{(n - 1) S_{x_j}^2}
+$$
+
$\frac{1}{1 - R_j^2}$ is called the variance inflation factor.
+
A rough guide for identifying large VIF is to use the cut-off value 5.
+
What do you do when collinearity exists?
+
+- Do nothing but be careful of interpretation.
+- Remove highly correlated variables (keep only one of them).
+