PCA

Principal Component Analysis (PCA)

Principal Component Analysis (Jolliffe, 2005) is primarily used to explore one single type of ‘omics data (e.g. transcriptomics, proteomics, metabolomics, etc) and identify the largest sources of variation. PCA is a mathematical procedure that uses orthogonal linear transformation of data from possibly correlated variables into uncorrelated principal components (PCs). The first principal component explains as much of the variability in the data as possible, and each following PC explains as much of the remaining variability as possible. Only the PCs which explain the most variance are retained. This is why choosing the number of dimensions or components (ncomp) is crucial (see the function tune.pca, below).

In mixOmics, PCA is numerically solved in two ways:

1. With singular value decomposition (SVD) of the data matrix,which is the most computationally efficient way and is also adopted by most softwares and the R function prcomp in the stat package.

2. With the Non-linear Iterative Partial Least Squares (NIPALS) in the case of missing values, which uses an iterative power method. See Methods: Missing values.

Both methods are embedded in the mixOmics pca function and will be used accordingly.

Input data should be centered (center = TRUE) and possibly (sometimes preferably) scaled so that all variables have a unit variance. This is especially advised in the case where the variance is not homogeneous across variables (scale = TRUE). By default, the variables are centered and scaled in the function, but the user is free to choose other options.

library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene # Using one data set only

Choosing the optimal parameters

We can obtain as many dimensions (i.e. number of PCs) as the minimum between the number of samples and variables. However, the goal is to reduce the complexity of the data and therefore summarize the data in fewer underlying dimension.

The number of principal Components to retain (also called the number of dimensions) is therefore crucial when performing PCA. The function tune.pca will plot the barplot of the proportion of explained variance for min(n, p)principal components, where n is the number of samples, and p the number of variables.

tune.pca(X, ncomp = 10, center = TRUE, scale = FALSE)
## Eigenvalues for the first 10 principal components, see object$sdev^2: 
##        PC1        PC2        PC3        PC4        PC5        PC6 
## 17.9714164  9.0792340  4.5677094  3.2043829  1.9567988  1.4686086 
##        PC7        PC8        PC9       PC10 
##  1.3281206  1.0820554  0.8434155  0.6373565 
## 
## Proportion of explained variance for the first 10 principal components, see object$explained_variance: 
##        PC1        PC2        PC3        PC4        PC5        PC6 
## 0.35684128 0.18027769 0.09069665 0.06362638 0.03885429 0.02916076 
##        PC7        PC8        PC9       PC10 
## 0.02637122 0.02148534 0.01674690 0.01265538 
## 
## Cumulative proportion explained variance for the first 10 principal components, see object$cum.var: 
##       PC1       PC2       PC3       PC4       PC5       PC6       PC7 
## 0.3568413 0.5371190 0.6278156 0.6914420 0.7302963 0.7594570 0.7858283 
##       PC8       PC9      PC10 
## 0.8073136 0.8240605 0.8367159 
## 
##  Other available components: 
##  -------------------- 
##  loading vectors: see object$rotation

plot of chunk unnamed-chunk-2

Given the barplot output above, we can choose 2 to 3 principal components for the final analysis.

PCA

result <- pca(X, ncomp = 3, center = TRUE, scale = FALSE)
result
## Eigenvalues for the first 3 principal components, see object$sdev^2: 
##       PC1       PC2       PC3 
## 17.971416  9.079234  4.567709 
## 
## Proportion of explained variance for the first 3 principal components, see object$explained_variance: 
##        PC1        PC2        PC3 
## 0.35684128 0.18027769 0.09069665 
## 
## Cumulative proportion explained variance for the first 3 principal components, see object$cum.var: 
##       PC1       PC2       PC3 
## 0.3568413 0.5371190 0.6278156 
## 
##  Other available components: 
##  -------------------- 
##  loading vectors: see object$rotation

Sparse Principal Component Analysis (sPCA)

sPCA (Shen and Huang, 2008) is based on singular value decomposition and is appropriate to deal with large data sets. As implemented in mixOmics, ‘sparsity’ is achieved via LASSO penalizations. sPCA is useful to remove some of the non informative variables in PCA and can be used to investigate whether ‘tighter’ sample clusters can be obtained and which are the variables that highly contribute to each PC.

For sPCA, the number of variables to select on each PC must be input by the user ( keepX ). Tuning sPCA keepX based on the amount of explained variance is difficult (the less variables, including noisy variables, the less variance is explained). Since sPCA is an unsupervised and exploratory technique, we prefer to let the user select a keepX suitable to the research question. The following example shows an arbitrary keepX to select the top (10, 5, 15) genes that contribute the most to the variance in the data on the PCs 1, 2 and 3. The function selectVar highlights the variables selected on the comp = 1 PC and outputs their weights in the associated loading vector:

spca.result <- spca(X, ncomp = 3, center = TRUE, scale = TRUE, 
                    keepX = c(10, 5, 15))
#spca.result

selectVar(spca.result, comp = 1)$value
##                 value.var
## A_43_P16829  -0.528762748
## A_42_P680505 -0.428099280
## A_43_P20475  -0.411956357
## A_43_P11409  -0.336141163
## A_43_P21269  -0.331970291
## A_42_P814129 -0.245442052
## A_43_P20891  -0.220272590
## A_43_P20281  -0.175688144
## A_43_P14037  -0.067696801
## A_42_P751969 -0.005174183

Case study

See Case Study: PCA Multidrug for more examples and plotting options.

References

PCA

  1. Jolliffe I.T. (2002) Principal Component Analysis. Springer Series in Statistics, Springer, New York.

sparse PCA

  1. Shen, H. and Huang, J. Z. (2008). Sparse principal component analysis via regularized low rank matrix approximation. 99(6), 1015–1034.

  2. Witten D.M. and Tibshirani R. and Hastie T. (2009) A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10(3)