The plotIndiv()
function shows the relationship and
similaries between samples. The samples are represented as points in a
two (or three) dimensional subspace that is spanned by latent variables
yeilded from the multivariate models. These plots allow for the
clustering of samples to be evaluated.
When using this function, points can be coloured and titled. In unsupervised cases, this can lead to false assumptions about the clustering of the samples. It is good practice to inspect a plot with no colours or symbols to gain an unbiased perspective on the similarity of points within the dataset(s).
library(mixOmics)
data(nutrimouse)
There are many different parameters which can be input into
plotIndiv()
to control the visualisation. Here, the primary
few will be explained in detail.
While it may seem obvious, this parameter is worth explicitly noting.
All figures should have an appropriate title so readers can easily
identify what is being shown in a given plot. It will default to
"plotIndiv"
. A single string is passed into this parameter
and examples of its use can be seen throughout this page.
When integrating two data sets, the plotIndiv()
function
enables the representation of the samples into a specific projection
space:
rep.space = X-variate
,rep.space = Y-variate
,rep.space = XY-variate
.Figure 1a shows the default case when no argument for
rep.space
is provided. It will show each
X- and Y- space separately. Figure 1b
shows what the corresponding plot looks like if the XY
subspace is used.
X <- nutrimouse$lipid # set lipid concentrations as X dataset
Y <- nutrimouse$gene # set genetic expression as Y dataset
pls.nutri <- pls(X, Y, ncomp = 2) # undergo PLS regression
# plot in separate subspaces
plotIndiv(pls.nutri, title = 'Figure 1a: PLS on lipid and gene data')
plotIndiv(pls.nutri, # plot in joint subspace
rep.space = 'XY-variate',
title = 'Figure 1b: PLS on nutrimouse lipid and gene data')
FIGURE 1: Samples plots from PLS regression on the nutrimouse data to depict the differences between projecting them into individual X and Y spaces, or the averaged XY subspace
Due to R containing multiple different plotting packages, the desired
package to be used can be set using the style
parameter. It
defaults to using 'ggplot2'
, but 'lattice'
and
'graphics'
can also be used. Figure 2 shows off the
differences between each of these styles.
An interactable 3D plot can also be produced by setting
style = 3d
. This requires the rgl
package to
be installed.
plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'ggplot2')
plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'lattice')
plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'graphics')
FIGURE 2: Default plots using the three different 2D plotting packages.
Ellipse-like confidence regions can be plotted around specific sample
groups of interest (Murdoch and Chow, 1996). In the unsupervised or
regression methods, the argument group must be specified to indicate the
samples to be included in each ellipse. In the supervised methods, the
samples are assigned by default to the outcome of interest that is
specified in the method. Figure 4 shows what these ellipses look like at
a 95% confidence level. This level can be set manually using the
ellipse.level
parameter.
plotIndiv(pls.nutri, group = nutrimouse$genotype,
rep.space = 'XY-variate',
ellipse = TRUE, # plot using the ellipses
legend = TRUE)
FIGURE 3: Sample plot of PLS regression on nutrimouse data to depict the use of the confidence ellipses.
Here, the PCA methodology is used on the nutrimouse
lipid concentration data to exemplify the use of this function in the
context of a single dataframe. From Figure 4, one could look at
clustering by genotype or by diet.
pca.lipid <- pca(X, ncomp = 2) # undergo basic PCA
plotIndiv(pca.lipid, group = nutrimouse$diet, # plot samples in PC space
pch = nutrimouse$genotype,
legend = TRUE, legend.title = 'Diet',
legend.title.pch = 'Genotype',
title = 'PCA on nutrimouse lipid data')
FIGURE 4: Sample plot from PCA on nutrimouse data. This plot can be used for clustering evaluation and to gain a better idea of the general structure of the data.
When dealing with only one predictive dataset in a supervised
context, the plotIndiv()
function can be combined with the
background.predict()
function to provide a meaningful
visualisation of how the model has been trained, and how it will
generalise to new data points. Figure 5 shows this on the
nutrimouse
dataset, where the lipid concentration data is
used as the predictor.
Y <- nutrimouse$genotype
splsda.nutri <- splsda(X, Y, ncomp = 2) # undergo basic sPLS-DA
# calculate the prediction background using the mahalanobis distance metric
background.mahal <- background.predict(splsda.nutri,
comp.predicted = 2,
dist = 'mahalanobis.dist')
# plot the sample plot
plotIndiv(splsda.nutri, pch = nutrimouse$genotype,
legend = TRUE, legend.title = 'Diet',
legend.title.pch = 'Genotype',
title = 'sPLS-DA on nutrimouse lipid data',
background = background.mahal)
FIGURE 5: Sample plot from sPLS-DA on nutrimouse lipid data. Includes a prediction background to show the classes that would be assigned to novel data points given their values on the first two latent components.
Unsupervised methods such as CCA or PLS integrate two datasets. This
results in pairs of novel components, where one from each pair belongs
to each dataset. plotIndiv()
is useful for understanding
the relationship structure between the two datasets. In these scenarios,
the subspace (X, Y or
XY) in which the samples are to be projected must be
selected. Figure 6 hows the samples from the nutrimouse
dataset projected onto components from the lipid concentration and
genetic expression data.
plotIndiv(pls.nutri, group = nutrimouse$diet,
pch = nutrimouse$genotype,
legend = TRUE, legend.title = 'Diet',
legend.title.pch = 'Genotype',
title = 'PLS on lipid and gene data')
FIGURE 6: Samples plots of PLS regression on nutrimouse data. These plots could be used to determine the similarities and differences between the two inputted datasets.
DIABLO, or multiblock (s)PLS-DA,
also integrate two datasets, but uses their information in order to
classify novel samples. Once again, plotIndiv()
provides
information on the relationship between the inputted dataframes, or
blocks. Figure 7 shows the degree of agreement between the
different blocks and the discriminative ability of each data set. This
example uses the breast.TCGA
dataset, as this contains 3
datasets that can be visualised.
Y <- nutrimouse$gene # set the Y dataframe to the genetic expression data
# undergo rCCA using the ridge regularisation method
rcca.res <- rcc(X, Y, ncomp = 3, method = 'ridge',
lambda1 = 0.064, lambda2 = 0.008)
plotIndiv(rcca.res, group = nutrimouse$genotype, ind.names = FALSE, # plot samples
legend = TRUE, title = 'rCCA on nutrimouse data')
FIGURE 7: Sample plots from rCCA on nutrimouse data.
Using a P-integrated framework, the independent studies can be
plotted individually or all together. The study
parameter
controls this. Including all.partial
will plot all studies
(as can be seen in Figure 8), where as using a specific number (eg.
"2"
) will just plot that study specifically. Figure 8 makes
use of a multigroup sPLS-DA analysis on the
stemcells
data.
data(stemcells) # extract stem cells data
mint.res <- mint.splsda(X = stemcells$gene, # undergo multigroup
Y = stemcells$celltype,# sPLS-DA
ncomp = 2,
keepX = c(10, 5),
study = stemcells$study) # specify studies to be used
# plot just the second study
#plotIndiv(mint.res, study = "2")
# plot study-specific outputs for all studies
plotIndiv(mint.res, study = "all.partial", legend = TRUE)
FIGURE 8: Sample plots from a multigroup sPLS-DA on the stem cells dataset. Projection of samples onto each study’s latent components is depicted. This aids in evaluating the similarity between each dataset.