plotIndiv() – Sample Plot
plotIndiv() function shows the relationship and similaries between samples. The samples are represented as points in a two (or three) dimensional subspace that is spanned by latent variables yeilded from the multivariate models. These plots allow for the clustering of samples to be evaluated.
When using this function, points can be coloured and titled. In unsupervised cases, this can lead to false assumptions about the clustering of the samples. It is good practice to inspect a plot with no colours or symbols to gain an unbiased perspective on the similarity of points within the dataset(s).
There are many different parameters which can be input into
plotIndiv() to control the visualisation. Here, the primary few will be explained in detail.
While it may seem obvious, this parameter is worth explicitly noting. All figures should have an appropriate title so readers can easily identify what is being shown in a given plot. It will default to
"plotIndiv". A single string is passed into this parameter and examples of its use can be seen throughout this page.
When integrating two data sets, the
plotIndiv() function enables the representation of the samples into a specific projection space:
- The space spanned by the components associated to the X data set, using the argument
rep.space = X-variate,
- The space spanned by the components associated to the Y data set, using the argument
rep.space = Y-variate,
- The space spanned by the mean of the components associated to the X and Y data sets, using the argument
rep.space = XY-variate.
Figure 1a shows the default case when no argument for
rep.space is provided. It will show each X- and Y- space separately. Figure 1b shows what the corresponding plot looks like if the XY subspace is used.
X <- nutrimouse$lipid # set lipid concentrations as X dataset Y <- nutrimouse$gene # set genetic expression as Y dataset pls.nutri <- pls(X, Y, ncomp = 2) # undergo PLS regression # plot in separate subspaces plotIndiv(pls.nutri, title = 'Figure 1a: PLS on lipid and gene data') plotIndiv(pls.nutri, # plot in joint subspace rep.space = 'XY-variate', title = 'Figure 1b: PLS on nutrimouse lipid and gene data')
FIGURE 1: Samples plots from PLS regression on the nutrimouse data to depict the differences between projecting them into individual X and Y spaces, or the averaged XY subspace
Due to R containing multiple different plotting packages, the desired package to be used can be set using the
style parameter. It defaults to using
'graphics' can also be used. Figure 2 shows off the differences between each of these styles.
An interactable 3D plot can also be produced by setting
style = 3d. This requires the
rgl package to be installed.
plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'ggplot2') plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'lattice') plotIndiv(pls.nutri, rep.space = 'XY-variate', style = 'graphics')
FIGURE 2: Default plots using the three different 2D plotting packages.
Ellipse-like confidence regions can be plotted around specific sample groups of interest (Murdoch and Chow, 1996). In the unsupervised or regression methods, the argument group must be specified to indicate the samples to be included in each ellipse. In the supervised methods, the samples are assigned by default to the outcome of interest that is specified in the method. Figure 4 shows what these ellipses look like at a 95% confidence level. This level can be set manually using the
plotIndiv(pls.nutri, group = nutrimouse$genotype, rep.space = 'XY-variate', ellipse = TRUE, # plot using the ellipses legend = TRUE)
FIGURE 3: Sample plot of PLS regression on nutrimouse data to depict the use of the confidence ellipses.
plotIndiv() in Unsupervised Single Omics
Here, the PCA methodology is used on the
nutrimouse lipid concentration data to exemplify the use of this function in the context of a single dataframe. From Figure 4, one could look at clustering by genotype or by diet.
pca.lipid <- pca(X, ncomp = 2) # undergo basic PCA plotIndiv(pca.lipid, group = nutrimouse$diet, # plot samples in PC space pch = nutrimouse$genotype, legend = TRUE, legend.title = 'Diet', legend.title.pch = 'Genotype', title = 'PCA on nutrimouse lipid data')
FIGURE 4: Sample plot from PCA on nutrimouse data. This plot can be used for clustering evaluation and to gain a better idea of the general structure of the data.
plotIndiv() in Supervised Single Omics
When dealing with only one predictive dataset in a supervised context, the
plotIndiv() function can be combined with the
background.predict() function to provide a meaningful visualisation of how the model has been trained, and how it will generalise to new data points. Figure 5 shows this on the
nutrimouse dataset, where the lipid concentration data is used as the predictor.
Y <- nutrimouse$genotype splsda.nutri <- splsda(X, Y, ncomp = 2) # undergo basic sPLS-DA # calculate the prediction background using the mahalanobis distance metric background.mahal <- background.predict(splsda.nutri, comp.predicted = 2, dist = 'mahalanobis.dist') # plot the sample plot plotIndiv(splsda.nutri, pch = nutrimouse$genotype, legend = TRUE, legend.title = 'Diet', legend.title.pch = 'Genotype', title = 'sPLS-DA on nutrimouse lipid data', background = background.mahal)
FIGURE 5: Sample plot from sPLS-DA on nutrimouse lipid data. Includes a prediction background to show the classes that would be assigned to novel data points given their values on the first two latent components.
plotIndiv() in Multi-Omics
Unsupervised methods such as CCA or PLS integrate two datasets. This results in pairs of novel components, where one from each pair belongs to each dataset.
plotIndiv() is useful for understanding the relationship structure between the two datasets. In these scenarios, the subspace (X, Y or XY) in which the samples are to be projected must be selected. Figure 6 hows the samples from the
nutrimouse dataset projected onto components from the lipid concentration and genetic expression data.
plotIndiv(pls.nutri, group = nutrimouse$diet, pch = nutrimouse$genotype, legend = TRUE, legend.title = 'Diet', legend.title.pch = 'Genotype', title = 'PLS on lipid and gene data')
FIGURE 6: Samples plots of PLS regression on nutrimouse data. These plots could be used to determine the similarities and differences between the two inputted datasets.
plotIndiv() in an N-Integration Framework
DIABLO, or multiblock (s)PLS-DA, also integrate two datasets, but uses their information in order to classify novel samples. Once again,
plotIndiv() provides information on the relationship between the inputted dataframes, or blocks. Figure 7 shows the degree of agreement between the different blocks and the discriminative ability of each data set. This example uses the
breast.TCGA dataset, as this contains 3 datasets that can be visualised.
Y <- nutrimouse$gene # set the Y dataframe to the genetic expression data # undergo rCCA using the ridge regularisation method rcca.res <- rcc(X, Y, ncomp = 3, method = 'ridge', lambda1 = 0.064, lambda2 = 0.008) plotIndiv(rcca.res, group = nutrimouse$genotype, ind.names = FALSE, # plot samples legend = TRUE, title = 'rCCA on nutrimouse data')
FIGURE 7: Sample plots from rCCA on nutrimouse data.
plotIndiv() in P-Integration Framework
Using a P-integrated framework, the independent studies can be plotted individually or all together. The
study parameter controls this. Including
all.partial will plot all studies (as can be seen in Figure 8), where as using a specific number (eg.
"2") will just plot that study specifically. Figure 8 makes use of a multigroup sPLS-DA analysis on the
data(stemcells) # extract stem cells data mint.res <- mint.splsda(X = stemcells$gene, # undergo multigroup Y = stemcells$celltype,# sPLS-DA ncomp = 2, keepX = c(10, 5), study = stemcells$study) # specify studies to be used # plot just the second study #plotIndiv(mint.res, study = "2") # plot study-specific outputs for all studies plotIndiv(mint.res, study = "all.partial", legend = TRUE)
FIGURE 8: Sample plots from a multigroup sPLS-DA on the stem cells dataset. Projection of samples onto each study’s latent components is depicted. This aids in evaluating the similarity between each dataset.
Refer to the following case studies for a more in depth look at interpreting the output of the
- PCA – Multidrug
- IPCA – Liver Toxicity
- rCCA – Nutrimouse
- sPLS – Liver Toxicity
- sPLS-DA – SRBCT
- Multilevel – Vac18