library(mixOmics)
data(stemcells)
X <- stemcells$gene
Y <- stemcells$celltype
study <- stemcells$study
stem.mint.plsda <- mint.plsda(X, Y, study = study) #
plotIndiv(stem.mint.plsda) #
plotVar(stem.mint.plsda)
?mint.plsda
can be run to determine all default
arguments of this function:
ncomp = 2
): The first two PLS-DA
components are calculated,scale = TRUE
): Each feature is
standardised to zero means and unit variances.study
): This does
not have a default and needs to be explicitly specified. It is a list of
integers indicating which study each sample is drawn from.stem.mint.splsda <- mint.splsda(X, Y, study = study,
keepX = c(10,5))
plotIndiv(stem.mint.splsda) #
plotVar(stem.mint.splsda) #
selectVar(stem.mint.splsda, comp = 1)$name #
plotLoadings(stem.mint.splsda, method = 'mean', contrib = 'max')
?mint.splsda
can be run to determine all default
arguments of this function:
mint.plsda
,keepX
): If unspecified, all
features of the original dataframes will be used. Here, 10 and 5
features were arbitrarily selected.The Multivariate INTegrative
method, MINT, is a set of functions which is focused on the horizontal
integration (or P-integration) of datasets - such that these
datasets are measured on the same P variables (eg. same set of
genes). Within the mixOmics
package, there are multi-group
(aka. multiple study) variants of PCA, (s)PLS and (s)PLS-DA methods.
While the studies to be integrated will be homogeneous in the type of omics data used and the features that were measured, there will be systematic differences between each dataset. This is due to the difference in timing and geographical location of each study. Spurious results may be yielded from this analysis due to the variation between datasets outweighing the variation within each dataset.
MINT methods differ from their standard counterparts in that they control for batch effects prior to undergoing their normal procedures. Hence, the inter-dataset variation can be appropriately accounted for. This can not only increase effective sample size and statistical power, but also enables the sharing of data across research communities and the re-use of existing data deposited in public databases.
Previously, sPLS-DA was extended to combine independent transcriptomics studies and to identify a gene signature defining human Mesenchymal Stromal Cells (MSC) [1]. This is a topical question in stem cell biology, as MSCs are a poorly defined group of stromal cells despite their increasingly recognized clinical importance.
In that first study 84 highly curated public gene expression data
sets representing 125 MSC and 510 non-MSC samples spanning across 13
different microarray platforms were integrated.YuGene
normalisation [2] was utilised, combined with an improved sPLS-DA.
Extensive subsampling was used to avoid overfitting and to ensure a
robust and reproducible gene signature.
The resulting agnostic platform signature of 16 genes gave an
impressive classification accuracy of 97.8% on the training set, and
93.5% on an external test set (187 MSC and 474 non-MSC). The MSC
molecular signature predictor is available in the
Stemformatics
web resource, an R
package
bootsPLS
is also available on CRAN.
The
molecular signature has brought novel insights into the origin and
function of MSC and it can be considered as a more accurate alternative
to current immunophenotyping methods.
There are two overarching frameworks in the application of MINT in
mixOmics
:
The supervised approach: The aim is to classify novel samples and identify a set of discriminative markers leading to accurate class prediction on an external test set. In this context, there is a data matrix (X) which is used to explain the vector indicating class membership of each sample (y).
The unsupervised approach: Here, the aim is the integration of two data matrices (X and Y) in order to identify correlated variables (homologous to canonical PLS analysis) or variables from X which best explain Y (homologous to regression PLS analysis).
Both of these approaches have sparse and non-sparse variants.
FIGURE 1: Visualisation of the two different frameworks within mixOmics. The subscripts (1, .. M or 1, .. L) denote the set of different studies being integrated. P, Q and n denote the number of predictive features, response features and samples respectively.