library(mixOmics) data(liver.toxicity) # Call data set in the package X <- liver.toxicity$gene # Load gene expression data into a matrix
results <- ipca(X) # run the method plotIndiv(results) # plot the samples plotVar(results) # plot the variables
?ipca can be run to determine the default arguments of this function:
- Number of components (
ncomp = 2): Only the first two Principal Components are calculated .
- Scaling (
scale = FALSE): The data is not scaled. If set to
TRUE, all variables will be standardised to have unit variance.
results <- sipca(X) # run the method plotIndiv(results) # plot the samples plotVar(results) # plot the variables # extract the variables used to construct the first IPC selectVar(results, comp = 1)$name # depict weight assigned to each of these variables plotLoadings(results, method = 'mean', contrib = 'max')
?sipca can be run to determine the default arguments of this function:
- Number of components (
ncomp = 3): Only the first three Principal Components are calculated .
- Number of variables to calculate components (
keepX = [50, 50, 50]): This parameter defaults to a list, of
ncomplength, containing 50 variables for each component. The best 50 variables will be used to construct each component.
Independent Principal Component Analysis (IPCA)
In some case studies, Principal Component Analysis (PCA) has limitations, such as:
PCA assumes that gene expression follows a multivariate normal distribution. Recent studies suggest that not all omics data can be assumed to follow a Gaussian distribution. For instance, microarray gene expression seems to follow a super-Gaussian distribution.
PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data.
Independent Component Analysis (ICA) is a process where novel components are extracted from the data, not to maximise explained variance, but to denoise and reduce the impacts of artefacts [2, 3]. Components produced by ICA contain no overlapping information.
ICA and PCA can be combined into Independent Principal Component Analysis (IPCA) . IPCA combines the strengths of each component analysis method.
Sparse Independent Principal Component Analysis (sIPCA)
sIPCA applies the same framework that sPCA applies to PCA, but onto IPCA. This allows for Principal Components to be formed using a subset of optimally selected variables during the PCA portion of the IPCA algorithm.
Principles of (s)IPCA
IPCA is a non-parametric form of dimension reduction. It combines the goals of ICA and PCA such that it searches for components that capture the most variance across features which have had their noise reduced. The components yielded by IPCA (Independant Principle Components – IPCs) are non-overlapping and non-Gaussian. This is extremely useful in contexts where distributions such as the Super-Gaussian are expected (eg. microbiome data). In most cases where it is appropriate to use, IPCA outperforms both ICA and PCA by summarising the data better or requiring less components to summarise the data to the same degree.
The algorithm of IPCA is as follows:
- The original data matrix is centered (by default).
- PCA is used to reduce the dimensions of the data and generate the loading vectors. If sIPCA is being used, only specific variables are included in loading vectors.
- ICA (FastICA) is implemented on the loading vectors to reduce noise and yields independent loading vectors.
- The centered data matrix is projected on the independent loading vectors to obtain the Independent Principal Components (IPCs).
See Case Study: IPCA Liver toxicity for more examples and plotting options.