library(mixOmics)
data(multidrug) # call data set in the package
X <- multidrug$ABC.trans # load gene expression data into a matrix
result.pca.multi <- pca(X) # run the method
plotIndiv(result.pca.multi) # plot the samples
plotVar(result.pca.multi) # plot the variables
?pca
can be run to determine the default arguments of
this function:
ncomp = 2
): Only the first two
Principal Components are calculated.center = TRUE
): The data is centered, such
that all variables have a mean = 0.scale = FALSE
): The data is not scaled. If set
to TRUE
, all variables will be standardised to have unit
variance.result.spca.multi <- spca(X, keepX = c(50, 30)) # run the method
plotIndiv(result.spca.multi) # plot the samples
plotVar(result.spca.multi) # plot the variables
# extract the variables used to construct the first PC
selectVar(result.spca.multi, comp = 1)$name
# depict weight assigned to each of these variables
plotLoadings(result.spca.multi, method = 'mean', contrib = 'max')
?spca
can be run to determine the default arguments of
this function:
keepX = rep(ncol(X), ncomp)
): By default, this parameter
will use all variables to compute the selected number of Principal
Components.ncomp
andcenter
as the pca()
function. In this case, the scale
parameter is set to TRUE
by default.Principal Component Analysis [1] is primarily used for the exploration and identification of the largest sources of variation within omics datasets. The aim of PCA is to reduce the dimensionality of the inputted data, while retaining as much information as possible, to allow for visualisation. PCA is a mathematical procedure that constructs novel, orthogonal axes which are linear combinations of the original axes. These new axes are the Principal Components (PCs). PCs are selected based on their explained variance and ordered in descending order. Hence, the first PC will always capture the most variance from the original data, with each subsequent PC capturing less than the one before it.
sPCA [2] is based on singular value decomposition and is appropriate
for dealing with large data sets where not all variables are likely to
be equally important. As implemented in mixOmics
,
‘sparsity’ is achieved via LASSO penalisation, such that PCs are no
longer a linear combination of all original variables - just a subset
containing the ‘best’ (information rich) variables. sPCA can be used to
investigate whether ‘tighter’ sample clusters can be obtained as
redundant and non-discriminatory variables are not included.
This is an unsupervised, exploratory method which seeks to reduce the dimensionality of the data whilst retaining as much of the original information as possible. Principal components (PC) are yielded, where are linear combinations of the original dataset’s features. The weight of each feature in contributing to a given PC is defined by that PC’s corresponding loading vector.
The data is projected onto the PCs using the loading vectors to determine their new position in the PC spanned subspace. PCs are calculated to maximise the captured variance - meaning the as PCs are produced, each explains less variance than the one before.
The original dataset can be reconstructed by multiplying the matrix of Principal components with the ‘loading vectors’ - the weights assigned to the variables.
In mixOmics
, (s)PCA is numerically solved in two ways
(both are embedded within pca()
and
spca()
):
1. Singular value decomposition (SVD) of the data
matrix. This is the most computationally efficient method and is also
adopted by most software (including the R function prcomp
within the stat
package). SVD is suitable for data
containing no missing values.
2. In the case of missing values, Non-linear Iterative Partial Least Squares (NIPALS) can be utilised. This method is less efficient but more robust and accurate.