Multilevel Analysis
A multilevel, multivariate approach was developed in order to be able to accurately assess the complex structures of repeated measurement design data (which commonly use multiple different assays). This is where different treatments are applied onto the same subjects, or the same subjects were recorded at numerous time points. The issue with this type of data is that the variation between individuals may dwarf the variation that exists between the treatment groups, such that samples from a given individual will cluster together and provide little insight to the biological differences between the treatment groups.
The “multilevel” methodology was developed in collaboration with Dr. B. Liquet. Its implementation within mixOmics
allows for both supervised and unsupervised frameworks, as well as one- or two- factor frameworks, to utilise the multilevel methodology. This procedure has been shown to markedly increase the quality of feature selection and/or classification accuracy [1].
Many different functions (pls()
, spls()
, plsda()
, splsda()
, pca()
and spca()
) within the package contain the multilevel
argument. In all these cases, the withinVariation()
is called internally to extract the desired variation from the original dataframe. withinVariation()
can be called manually, as will be outlined below.
Requirements for a Multilevel Analysis
library(mixOmics) # import the mixOmics library
data(vac18) # extract the vac18 data
There are two main requirements for a multilevel analysis to be applicable and useful. Firstly, the data must have a repeated design. This may be across different time points (eg. prior- and post-treatment) or across different body sites (eg. microbiome samples from different organs within the one individual). This will be represented by the sample IDs of the dataset being repeated more than once. This can be seen for the vac18
study included in the package:
# the first row is the sample IDs, the second row is the corresponding frequency of each ID
summary(as.factor(vac18$sample))
## 1 2 3 4 5 6 7 8 9 10 11 12 ## 4 4 2 4 3 3 4 3 4 4 4 3
Note that this methodology works in unbalanced design scenarios (the frequency of sample IDs is not uniform).
Secondly, this method will be beneficial when the individual variation is significantly greater than the repeated measure variation. As mentioned above, this is when samples from the same individual are clustered more tightly compared to samples from the same group (ie. treatment or time). In this case, multilevel decomposition will reveal subtle differences which would otherwise be masked by the individual variation.
Below in Figure 1, a PCA plot on the original dataframe and a PCA plot on the multilevel decomposed dataframe are depicted. Looking at the original dataframe, samples from the same individual (shown by the number of each point) can be seen with high proximity and the treatment groups (shown by the colour of the point) seem to overlap considerably. This is the scenario where multilevel analysis will be advantageous to utilise. The decomposed dataframe separates the treatment groups much better as the samples from the same individual are less clustered.
# undergo pca and plot samples without any multilevel decomposition
pca.result <- pca(vac18$genes, scale = TRUE)
plotIndiv(pca.result,
ind.names = vac18$sample,
group = vac18$stimulation,
title = 'Figure 1a: PCA on VAC18 data')
# undergo pca and plot samples with multilevel decomposition
pca.result <- pca(vac18$genes, multilevel = vac18$sample, scale = TRUE)
plotIndiv(pca.result,
ind.names = vac18$sample,
group = vac18$stimulation,
legend = TRUE,
legend.title = "Treatment",
title = 'Figure 1b: Multilevel PCA on VAC18 data')
FIGURE 1: PCA plots of the original vac18 gene expression data and the multilevel decomposition on the same dataset
Figure 1b can be yielded by using the withinVariation()
function manually.
X <- vac18$genes # extract dataframe
design <- data.frame(sample = vac18$sample) # set multilevel design using sample IDs for each instance
Xw <- withinVariation(X, design) # decompose the dataframe
pca.result <- pca(Xw, scale = TRUE) # apply pca to decomposed dataframe
plotIndiv(pca.result, ind.names = vac18$sample, group = vac18$stimulation) # plot samples
Difference between decomposition methods
As mentioned directly above, the multilevel decomposition of a data frame can be achieved through the use of the withinVariation()
function prior to model building or by passing in the multilevel
parameter during model building. In some cases, these are essentially equivalent. This is not always the case however. In regards to classification models particularly, the resulting performance may vary quite drastically.
The tuning functions for methods such as splsda()
require that each of the inputted samples are independent of one another. Through use of multilevel
, tune.multilevel()
ensures that all other samples from the same same individual are removed during the cross-validation procedure – hence all samples would be independent. This is not the case when using the withinVariation()
function manually – though its action should make the samples effectively independent.
The primary take away from this disclaimer is that while use of the multilevel
parameter as part of tuning is simpler and usually results in a better model, this is not guaranteed. Also, using multilevel
is on average considerably slower than withinVariation()
. In the case where performance is suboptimal, the use of withinVariation()
should be explored.
Case study
See the Case Study: Multilevel Vac18 and Case Study: Multilevel Liver Toxicity for more tuning details and plotting options.