Koren Bodysites Case Study

Case Study of MixMC sPLS-DA with Koren Bodysite data

The mixMC framework is one that is specifically built for microbial datasets and will be used here on the Koren Bodysite dataset yielded by Koren et al, 2011 [1]. A sPLS-DA methodology will be employed in order to predict the bodysite a given sample was drawn from based on the OTU data (Operational Taxonomic Unit). The model will select the optimal set of OTUs to perform this prediction.

For background information on the mixMC or sPLS-DA methods, refer to the MixMC Method Page and sPLS-DA Method Page.

R script

The R script used for all the analysis in this case study is available here.

To begin

Load the latest version of mixOmics. Note that the seed is set such that all plots can be reproduced. This should not be included in proper use of these functions.

library(mixOmics) # import the mixOmics library

set.seed(5249) # for reproducibility, remove for normal use

The data

The Koren bodysite dataset derives from an examination of the link between oral, gut and plaque microbial communities in patients with atherosclerosis (vs. controls). While the original methodology of this study involved repeated measures, for the sake of simplicity it will be considered as non-repeated for this case study. It is assumed that the data are offset and pre-filtered, as described in mixMC pre-processing steps.

The mixOmics Koren dataset is accessed via Koren.16S and contains the following:

  • Koren.16S$data.TSS (continuous matrix): 43 rows and 980 columns. The prefiltered
    normalised data using Total Sum Scaling normalisation.
  • Koren.16S$data.raw (continuous matrix): 43 rows and 980 columns. The prefiltered raw
    count OTU data which include a 1 offset (i.e. no 0 values).
  • Koren.16S$taxonomy (categorical matrix): 980 rows and 7 columns. Contains the taxonomy (ie. Phylum, … Genus, Species) of each OTU.
  • Koren.16S$indiv (categorical matrix): 43 rows and 22 columns. Contains all the sample meta data recorded.
  • Koren.16S$bodysite (categorical vector): factor of length 43 indicating the bodysite with levels arterial plaque, saliva and stool.

The raw OTU data will be used as predictors (X dataframe) for the bodysite (Y vector). The dimensions of the predictors is confirmed and the distribution of the response vector is observed.

data("Koren.16S") # extract the microbial data
X <- Koren.16S$data.raw # set the raw OTU data as the predictor dataframe
Y <- Koren.16S$bodysite # set the bodysite class as the response vector

dim(X) # check dimensions of predictors
## [1]  43 980
summary(Y) # check distribution of response
## arterial plaque          saliva           stool 
##              13              15              15

Initial Analysis

Preliminary Analysis with PCA

The first exploratory step involves using PCA (unsupervised analysis) to observe the general structure and clustering of the data to aid in later analysis. As this data are compositional by nature, a centered log ratio (CLR) transformation needs to be undergone in order to reduce the likelihood of spurious results. This can be done by using the logratio parameter in the PCA function.

Here, a PCA with a sufficiently large number of components (ncomp = 10) is generated to choose the final reduced dimension of the model. Using the ‘elbow method’ in Figure 1, it seems that two components will be more than sufficient for a PCA model.

koren.pca = pca(X, ncomp = 10, logratio = 'CLR') # undergo PCA with 10 comps

plot(koren.pca) # plot explained variance

plot of chunk unnamed-chunk-3

FIGURE 1: Bar plot of the proportion of explained variance by each principal component yielded from a PCA.

Below, the samples can be seen projected onto the first two Principal Components (Figure 2). The separation of each bodysite can be clearly seen, even just across these two components (account for a total of 37% of the OTU variation). It seems that the first component is very good at defining the arterial plaque bodysite against the other two bodysites while the second discriminates between the saliva and stool bodysites very well.

koren.pca = pca(X, ncomp = 2, logratio = 'CLR') # undergo PCA with 2 components

plotIndiv(koren.pca, # plot samples projected onto PCs
          ind.names = FALSE, # not showing sample names
          group = Y, # color according to Y
          legend = TRUE,
          title = 'Koren OTUs, PCA Comps 1&2')

plot of chunk unnamed-chunk-4

FIGURE 2: Sample plots from PCA performed on the Koren OTU data. Samples are projected into the space spanned by the first two components.

Initial sPLS-DA model

The mixMC framework uses the sPLS-DA multivariate analysis from mixOmics [3]. Hence, the next step involves generating a basic PLS-DA model such that it can be tuned and then evaluated. In many cases, the maximum number of components needed is k-1 where k is the number of categories within the outcome vector (y) – which in this case is 3. Once again, the logratio parameter is used here to ensure that the OTU data are transformed into an Euclidean space.

basic.koren.plsda = plsda(X, Y, logratio = 'CLR', 
                          ncomp = nlevels(Y))

Tuning sPLS-DA

Selecting the number of components

The ncomp Parameter

To set a baseline from which to compare a tuned model, the performance of the basic model is assessed via the perf() function. Here, a 5-fold, 10-repeat design is utilised. To obtain a more reliable estimation of the error rate, the number of repeats should be increased (between 50 to 100). Figure 3 shows the error rate as more components are added to the model (for all three distance metrics). The Balanced Error Rate (BER) is more appropriate to use here due to the unbalanced response vector (arterial plaque has less samples than the other two bodysites). Across all three distance metrics, a model using three components had the lowest error rate. Hence, it seems that three components is appropriate.

# assess the performance of the sPLS-DA model using repeated CV
basic.koren.perf.plsda = perf(basic.koren.plsda,  
                              validation = 'Mfold', 
                              folds = 5, nrepeat = 10, 
                              progressBar = FALSE)

# extract the optimal component number
optimal.ncomp <- basic.koren.perf.plsda$choice.ncomp["BER", "max.dist"] 

plot(basic.koren.perf.plsda, overlay = 'measure', sd=TRUE) # plot this tuning

plot of chunk unnamed-chunk-6

FIGURE 3: Classification error rates for the basic sPLS-DA model on the Koren OTU data. Includes the standard and balanced error rates across all three distance metrics.

Selecting the number of variables

The keepX Parameter

With the suggestion that ncomp = 3 is the optimal choice, the number of features to use for each of these components needs to be determined (keepX). This will be done via the tune.splsda() function. Again, a 5-fold, 10-repeat cross validation process is used but in reality these values should be higher. Figure 4 below shows that the error rate is minimised when 2 components are included in the sPLS-DA, whereas the third component seems to add noise. The diamonds indicate the optimal keepX variables to select on each component based on the balanced error rate.

Note that the selected ncomp from the tune.splsda() and perf() functions are different. As the sparse model (sPLS-DA) will be used, ncomp = 2 is the more appropriate choice.

grid.keepX = c(seq(5,150, 5))

koren.tune.splsda = tune.splsda(X, Y,
                          ncomp = optimal.ncomp, # use optimal component number
                          logratio = 'CLR', # transform data to euclidean space
                          test.keepX = grid.keepX,
                          validation = c('Mfold'),
                          folds = 5, nrepeat = 10, # use repeated CV
                          dist = 'max.dist', # maximum distance as metric
                          progressBar = FALSE)

# extract the optimal component number and optimal feature count per component
optimal.keepX = koren.tune.splsda$choice.keepX[1:2] 
optimal.ncomp = koren.tune.splsda$choice.ncomp$ncomp 

plot(koren.tune.splsda) # plot this tuning

plot of chunk unnamed-chunk-7

FIGURE 4: Tuning keepX for the sPLS-DA performed on the Koren OTU data. Each coloured line represents the balanced error rate (y-axis) per component across all tested keepX values (x-axis) with the standard deviation based on the repeated cross-validation folds.

Final Model

Following this tuning, the final sPLS-DA model can be constructed using these optimised values.

koren.splsda = splsda(X,  Y, logratio= "CLR", # form final sPLS-DA model
                      ncomp = optimal.ncomp, 
                      keepX = optimal.keepX)

Plots

Sample Plots

The sample plot found in Figure 5 depicts the projection of the samples onto the first two components of the sPLS-DA model. There is distinct separation of each bodysite as shown by the lack of any overlap of the 95% confidence ellipses.

plotIndiv(koren.splsda,
          comp = c(1,2),
          ind.names = FALSE,
          ellipse = TRUE, # include confidence ellipses
          legend = TRUE,
          legend.title = "Bodysite",
          title = 'Koren OTUs, sPLS-DA Comps 1&2')

plot of chunk unnamed-chunk-9

FIGURE 5: Sample plots from sPLS-DA performed on the Koren OTU data. Samples are projected into the space spanned by the first two components.

Another way to visualise the similarity of samples is through the use of a clustered image map (CIM). Figure 6 clearly shows which OTUs are associated with a given bodysite. This plot also depicts the sparsity commonly seen in microbial datasets.

cim(koren.splsda,
    comp = c(1,2),
    row.sideColors = color.mixo(Y), # colour rows based on bodysite
    legend = list(legend = c(levels(Y))),
    title = 'Clustered Image Map of Koren Bodysite data')

newplot

FIGURE 6: Clsutered Image Map of the Koren OTU data after sPLS-DA modelling. Only the keepX selected feature for components 1 and 2 are shown, with the colour of each cell depicting the raw OTU value after a CLR transformation.

Variable Plots

Interpretting Figure 7 below in conjunction with Figure 5 provides key insights into what OTUs are responsible for identifying each bodysite. Note that cutoff = 0.7 such that any feature with a vector length less than 0.7 is not shown. Three distinct clusters of OTUs can be seen – one of each seemingly corresponding to one of the three bodysites. The feature found by itself (4391625) could be interpretted as either:

  • being important in the identification of both the arterial plaque and saliva bodysites, or
  • being important in the identification that a sample does not belong to the stool bodysite.
plotVar(koren.splsda,
        comp = c(1,2),
        cutoff = 0.7, rad.in = 0.7,
        title = 'Koren OTUs, Correlation Circle Plot Comps 1&2')

plot of chunk unnamed-chunk-11

FIGURE 7: Correlation circle plot representing the OTUs selected by sPLS-DA performed on the Koren OTU data. Only the OTUs selected by sPLS-DA are shown in components 1 and 2. Cutoff of 0.7 used

The network plot seen in Figure 8 shows furthers the understanding of which OTUs are contributing to each bodysite. For each bodysite, there is a grouping of OTUs which identify that bodysite and none other. There are also two small clusters which contain OTUs which are found at two of the three bodysites, showing a similarity in the microbial communities of these areas. Interestingly, the saliva and arterial plaque bodysites share no common OTUs (at cutoff = 0.5).

network(koren.splsda,
        cutoff = 0.5,
        color.node = c("orange","lightblue"))

newplot

FIGURE 8: Relevance Network graph of the Koren OTUs selected by sPLS-DA on this dataset.

Evaluation of sPLS-DA

The mixOmics package also contains the ability to assess the classification performance of the sPLS-DA model that was constructed via the perf() function once again. The mean error rates per component and the type of distance are output. It can be beneficial to increase the number of repeats for more accurate estimations. It is clear from the below output that adding the second component drastically decreases the error rate.

# for reproducible results for this code, remove for your own code
set.seed(5249)  

# evaluate classification, using repeated CV and maximum distance as metric
koren.perf.splsda = perf(koren.splsda, validation = 'Mfold', 
                         folds = 5, nrepeat = 10, 
                         progressBar = FALSE, dist = 'max.dist') 

koren.perf.splsda$error.rate
## $overall
##        max.dist
## comp1 0.3116279
## comp2 0.0000000
## 
## $BER
##        max.dist
## comp1 0.3381197
## comp2 0.0000000

OTU Selection

The sPLS-DA selects the most discriminative OTUs that best characterize each body site. The below loading plots (Figures 9a&b) display the abundance of each OTU and in which body site they are the most abundant for each sPLS-DA component. Viewing these bar plots in combination with Figures 5 and 7 aid in understanding the similarity between bodysites. For both components, the 20 highest contributing features are depicted.

OTUs selected on the first component are mostly highly abundant in stool samples based on the mean of each OTU per body site. All OTUs seleced on the second component are strongly associated with the two other body sites. A positive contribution to the second component indicates association with the saliva bodysite while a negative value is indicative of the arterial plaque site.

plotLoadings(koren.splsda, comp = 1, 
             method = 'mean', contrib = 'max',  
             size.name = 0.8, legend = FALSE,  
             ndisplay = 20,
             title = "(a) Loadings of first component")

plotLoadings(koren.splsda, comp = 2, 
             method = 'mean', contrib = 'max',   
             size.name = 0.7,
             ndisplay = 20,
             title = "(b) Loadings of second comp.")

plot of chunk unnamed-chunk-14plot of chunk unnamed-chunk-14

FIGURE 9: The loading values of the top 20 contributing OTUs to the first (a) and second (b) components of a sPLS-DA undergone on the Koren OTU dataset. Each bar is coloured based on which bodysite had the maximum, mean value of that OTU.

To take this a step further, the stability of each OTU on these components can be assessed via the output of the perf() function. The below values (between 0 and 1) indicate the proportion of models (during the repeated cross validation) that used that given OTU as a contributor to the first sPLS-DA component. Those with high stabilities are likely to be the most important to defining a certain component.

# determine which OTUs were selected
selected.OTU.comp1 = selectVar(koren.splsda, comp = 1)$name 

# display the stability values of these OTUs
koren.perf.splsda$features$stable[[1]][selected.OTU.comp1] 
## 
## 1855954 4325533 4403632  752354  363064  323473  178557  179665  178462 4346374 4396298  353782  178385 
##    1.00    1.00    1.00    1.00    1.00    1.00    1.00    1.00    0.98    0.94    0.98    0.96    0.94 
##  211191 4468234 4391625  178627 4381553  188935  197698  181047  182271  191355 3889756  177702  237444 
##    0.96    0.94    0.98    0.98    0.96    0.96    0.94    0.94    0.92    0.94    0.98    0.92    0.94 
##  190990  197481  772282  177211  186133  174300 4381054  181155  180585  189282  178056  180572  843808 
##    0.94    0.96    0.96    0.94    0.92    0.92    0.94    0.96    0.92    0.92    0.96    0.94    0.92 
##  183532  198490  364824  182595 4422456  355307  192438  189403 3756485 1813954  198118  310284  343812 
##    0.96    0.92    0.92    0.92    0.98    0.92    0.92    0.90    0.86    0.94    0.86    0.92    0.90 
##  180903  337765  173883  174036 2714949  196902  182799  186606  191797  847228  364563  174862  340761 
##    0.94    0.84    0.92    0.82    0.86    0.90    0.86    0.82    0.88    0.92    0.86    0.74    0.84 
##  361266  189677  180090  178799 4421273  183157  196210  165118  989234  195185  178474 2951214  189760 
##    0.86    0.78    0.86    0.80    0.82    0.82    0.80    0.82    0.86    0.86    0.76    0.76    0.80 
##  554436  186891  342487  358914 4341534  365256 2442708  184021  182491  175751  178686  178517  165935 
##    0.84    0.78    0.76    0.78    0.74    0.78    0.62    0.72    0.74    0.76    0.60    0.84    0.72 
## 4468466  177859  179297  310608  197505  192698  185159  190577  192462  146586  178465  182986  176034 
##    0.80    0.72    0.66    0.74    0.74    0.70    0.64    0.64    0.62    0.56    0.60    0.56    0.48 
##  177905  190607  197510  188556  192218  179526  176318  353173  185339  180658 4127460 4424408  198439 
##    0.48    0.50    0.62    0.52    0.56    0.48    0.50    0.54    0.56    0.58    0.52    0.48    0.44 
##  351927  178386  358410  179601  322258  370772  180307  186392 
##    0.52    0.50    0.54    0.50    0.56    0.56    0.50    0.48

More information on Plots

For a more in depth explanation of how to use and interpret the plots seen, refer to the following pages:

References

  1. Koren, O., Knights, D., Gonzalez, A., Waldron, L., Segata, N., Knight, R., Huttenhower, C. and Ley, R.E., 2013. A guide to enterotypes across the human body: meta-analysis of microbial community structures in human microbiome datasets. PLoS Comput Biol, 9(1), p.e1002863
  2. Lê Cao KA, Costello ME, Lakis VA, Bartolo F, Chua XY, et al. (2016) MixMC: A Multivariate Statistical Framework to Gain Insight into Microbial Communities. PLOS ONE 11(8): e0160169
  3. Lê Cao, K.-A., Boitard, S., Besse, P.: Sparse PLS Discriminant Analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC bioinformatics 12(1), 253 (2011)
  4. Rohart F, Gautier B, Singh A, Lê Cao K-A (2017). mixOmics: an R package for ‘omics feature selection and multiple data integration.