From Johnson (1988):
PCA (Principal Components Analysis) involves a mathematical procedure that transforms a set of correlated variables into a smaller set of uncorrelated variables called principal components (PCs). These PCs are linear combinations of the original variables and can be thought of as "new" variables.
-
Data screening
Plotting can help to identify outliers, it may be difficult when there are many variables. PCA will help view the data in a smaller # of dimensions that may help identify outliers. -
Clustering
Determine how to group like observations. -
Predict classifications
When classifications of the observations are already know, PCA can be used to help develop ways to predict classifications. -
Regression analysis
When independent variables are highly correlated with each other, "intercorrelation" or "multicollinearity" is said to exist. This can cause estimates of the$\hat{\beta}$ 's to be unstable (meaning that from sample-to-sample-to-sample, the$\hat{\beta}$ 's have a lot of variability). Therefore, interpretation of how an independent and dependent variable are related by examining the$\hat{\beta}$ may not give good results.
When variables are highly correlated, they can often be represented just as well with a smaller set of variables using PCA. However, not everyone agrees this is good to do;see Hadi and Ling (American Statistician, 1998)
PCA is used primarily as an exploratory technique to be followed up with other analyses. The objectives of PCA are to
-
Discover the true dimension of the data.
If p dimension data (# of columns p) can be represented in q < p (q can be PC1, PC2... PCq) dimensions without losing much information, then do it. -
Try to interpret the PCs ("new" variables)
The PCs are linear combinations of the "original" variables. The weights for each of the original variables may give meaning to the PCs. For example, a weight of 0 could mean that a particular original variable is not important to the new variable. Finding interpretations for the PCs often is Very difficult.
Characteristics of the PCs:
- Uncorrelated
- First principal component accounts for as much variability in the data as possible
- Each successive principal component accounts for as much variability in the data as possible
Let X ~ (
a_1 is a p x 1 eigen vector corresponding the largest eigen value lambda_1
Where
Note this Var[$a^{T}_{1}$(the largest eigen value
(comp1) from the covariance matrix. The
Because the variance
Why are concerned about explaining variability? In statistics, we often equate variability with information. The more variability that you understand about a data set, the more information that you know about the data set. Refer back to when you were introduced to ANOVA or regression methods when explaining variability was focused on.
The PCs are orthogonal because the eigenvectors are orthogonal.
The total variance is the sum of the diagonal elements:
This can be thought of as a measure of the total variance of the original variables.
We'd like to find the smallest # of PCs such that
Plot
- If the original set of variables are already uncorrelated, PCA will not HELP.
- PCA doesn't generally eliminate variables because the PCs are linear combinations of the original variables.
- Thee original variables need to be measured in the same units and have similar variances.
Remember that PCA relies heavily on examining the variances of the original variables. Larger variance variables will dominate the other variables in the analysis.
For example, suppose there are three variables x1, x2 and x3 with variances of 98, 1.9 and 0.1, respectively. A way to maximize the variance of the first PC is to have it consist primarily of x1.
In general, a solution to this problem is to use standardized data scale(matrix)
in R. Equivalently, use P in place of covariance matrix because the P is the covariance matrix of standardized random variables.
PCA is most often performed using the correlation matrix P rather than the covariance matrix to eliminate the problem with different numerical scales being used with variables.
For each observation, we calculate the j^th principal component value or score as:
$hat{y_{rj}} = hat{a^_{j}} z_r$ (This is from a correlation matrix). The jth PC and the rth observation.
princomp output
Standard deviations:
comp1, comp2 ... comp_p means PC1, PC2 ... PCp
comp1^2 = the first eigen value -> the largest eigen value.
Loadings:
Comp.1 | |
---|---|
variable1 | 1.xx |
variable2 | 1.xx |
variable3 | 1.xx |
Comp.1 -> eigen vector
Note default calculation of princomp uses divisor N(biased estimate) for the covariance matirx instead of N-1(unbiased estimate)
reference - Applied Multivariate Statistical Analysis lecture note by Prof Chris Bilder
PC1 accounts for the most variation in the data, whereas LDA aims to maximaze the separation between the groups. (Both find linear combination of observed varaibles and elements in eigen vectors.)
PCA
Interested maximizing the seperatibility between the two groups. Looking for variables with the most variation.
LDA
Stqtquest https://youtu.be/azXCzI57Yfc
Focuses on maximizing the seperatibility among known categories.
g1, g2 = group1, group2
u_g1, u_g2 = group1/2 mean
s_g1, s_g2 = variance within group1/2
we maximize the numerator and minimize the denominator.
(u_g1 - u_g2)^2 / (s_g1^2 + s_g2^2)
Both rank the new axes in order of importance.
PC1 accounts for the most variation in the data.
LD1 (the first new axis that LDA creates) accounts for the most variation between the categories.
If the experiment is well controlled and has worked well, we should find that replicate samples cluster closely, whilst the greatest sources of variation in the data should be between treatments/sample groups. It is also an incredibly useful tool for checking for outliers and batch effects.
If samples from different groups form separate clusters, this indicates that the differences between groups are larger than those within groups. The biological signal of interest is stronger than the noise (biological and technical) and can be detected.
We aim in scRNAseq to compare cells based on their expression across genes, e.g. to identify similar transcriptomic profiles. Each gene represents then a dimension of the data.
Expressions of different genes are correlated if they are affected by the same biological process. The seperate information for these individual genes don't need to be stored, but can insted be compressed into a single dimension, e.g. an "eigenegene".
- reduces the computational work in downstream analyses to only a few dimensions
- reduces noise by averaging across multiple genes to obtain a more precise representation of the patterns in the data.
In PCA, the first axis (PC1) is chose such that it captures the greatest variance across cells in scRANseq. The next PC should be orthogonal to the PC1 and capture the greatest remaining amount of variation, and so on.
By applying PCA to scRNAseq, we assume that multiple genes are affected by the same biological processes in a coordinated way and random technical or biological noise affects each gene independently. As more variation can be captured by considering the correlated behaviour of many genes, the top PCs are likely to represent the biological signal and the noise are concentrated into the later PCs. The dominant factors of heterogenetiy are then captured by the top PCs.