We provide a mixMINT analysis example that combines six transcriptomics studies. A preprint of the method is available.
Latest mixOmics version
Dont forget to update the latest mixOmics version.
The data we analyse here is a small subset of the original data we analysed in our manuscript (we integrated 15 transcriptomics studies, 13,313 transcripts expression levels). The aim is to identify a gene signature discriminating the 3 classes of stem cells (Fibroblast, hESC and hiPS).
data(stemcells) ?stemcells #create the combined data set X X = stemcells$gene dim(X) #167 400 # the outcome vector Y: Y = stemcells$celltype length(Y) # 167 summary(Y) #Fibroblast hESC hiPS # 30 37 58 # the vector indicating each independent study study = stemcells$study # number of samples per study: summary(study) # 1 2 3 4 # 38 51 21 15
Running a MINT PLS-DA
As a first step we can run a MINT PLS-DA model where no feature selection is performed, with 2 components (usually we set ncomp = K-1, where K is the number of classes to discriminate).
mint.plsda.res = mint.plsda(X = X, Y = Y, study = study, ncomp = 2) mint.plsda.res #tells us the useful outputs / graphical function we can use
Let’s have a look at the sample plot: are we able to discriminate the three groups by including all those transcripts in the model?
plotIndiv(mint.plsda.res, legend = TRUE, title = 'MINT PLS-DA', subtitle = 'No variable selection')
Well, this is not too bad for a start – Fibroblasts are separated on the first PLS-DA component, while the second component seems to further discriminate hESC and hiPSC. From a biological perspective, we know that Fibroblasts are well-characterised and easier to classify than hESC and hiPS that are both pluripotent cells.
Running MINT sparse PLS-DA
We now run a sparse MINT PLS-DA model, where we specify 2 components, and a selection of 10 transcripts on component 1 and 15 transcripts on component 2 (tuning and performance evaluation can be performed as shown in the last section).
mint.splsda.res = mint.splsda(X = X, Y = Y, study = study, ncomp = 2, keepX = c(10, 15)) mint.splsda.res # lists useful functions that can be used with a MINT object # the selectVar function outputs the selected transcripts on the first component along # with the loading weight value selectVar(mint.splsda.res, comp = 1)$value value.var ENSG00000181449 -0.57327132 ENSG00000123080 0.44978896 ENSG00000110721 -0.42901548 ENSG00000176485 -0.31402002 ENSG00000184697 -0.24522534 ENSG00000102935 -0.23993515 ENSG00000198542 0.18561136 ENSG00000117676 -0.13163620 ENSG00000163661 0.12458874 ENSG00000112210 0.03715721
MINT sample plots
We can have a look at the global components (i.e. calculated on the combined data set X)
plotIndiv(mint.splsda.res, study = 'global', legend = TRUE, title = 'MINT sPLS-DA', subtitle = 'Global')
We observe an improvement of the separation on the first component (Fibroblast vs others) compared to a MINT PLS-DA model. hESC and hiPS separation is not clear cut(which makes sense biologically speaking).
We can also plot study-specific outputs using the partial components internal to the MINT sPLS-DA method.
plotIndiv(mint.splsda.res, study = 'all.partial', legend = TRUE, title = 'MINT sPLS-DA')
The study specific output is interesting to visualise which study can make the integrative analysis go ‘wrong’.
# to visualise only a specific study (not shown here) plotIndiv(mint.splsda.res, study = '2', legend = TRUE)
The plotArrow function displays the centroid within each study, and how each sample relate to its own study.
plotArrow(mint.splsda.res, legend = TRUE)
MINT variable plots
Let’s now have a look at our molecular signature selected with MINT sPLS-DA. Our favourite correlation circle plot highlights the contribution of each selected transcripts to each component (close to the large circle). See also more examples here.
plotVar(mint.splsda.res, cex = 4)
The Clustered Image Map (CIM) is a heatmap representing the molecular signature selected on both components.
Note: you can use the argument comp to only represent the variables selected on a given component. Have a look at ?cim for more options on cim, or here.
# display only the transcripts selected on comp 1. cim(mint.splsda.res, comp = 1)
There is also a relevant network option to visualise the relationship between the selected variables and the outcome of each category. The argument ‘comp’ specifies which variables are selected (only 1 integer allowed).
# uncomment the following if you experience margin issues with RStudio #jpeg(file = 'network.jpeg') network(mint.splsda.res, color.node = c(color.mixo(1), color.mixo(2)), comp = 1, shape.node = c("rectangle", "circle"), color.edge = color.jet(50), lty.edge = "solid", lwd.edge = 2, show.edge.labels = FALSE, interactive = FALSE) #dev.off()
For more details on how to interpret a correlation circle plot, a Clustered Image Map or a network, see our pedagogical examples in this article.
The plotLoadings function enables to visualise the loading weights of each selected variables on each component, either globally (using the global loading vector) or per study. The color indicates the class in which the variable has the maximum level of expression (contrib = ‘max’), on average (method = ‘mean’) or using the median (method = ‘median’, set by default).
# global loadings: not shown plotLoadings(mint.splsda.res, study = 'global', contrib = 'max') # study specify partial loadings plotLoadings(mint.splsda.res, study = 'all.partial', contrib = 'max')
The partial study specific plotLoading (shown above) is a good way to visualise that the selected variables are assigned similar loading weights in each study (i.e. there is a good agreement between all studies for these selected variables).
Performance of MINT sPLS-DA
The performance of the MINT model is assessed using ‘Leave-one-group-out’ cross-validation (LOGOCV), i.e. one study is put aside as an independent test set to predict on while the training is performed on the remaining studies based on a molecular signature of keepX = c(10,15) – as specified in our mint.splsda.res model.
The perf function outputs several criteria, including the classification error rate and the Balanced Error Rate (BER) to account for an unbalanced number of classes per group. The predicted scores, and stability of variables selected across each folds can also be obtained.
perf.mint = perf(mint.splsda.res, progressBar = FALSE) perf.mint perf.mint$global.error $BER max.dist centroids.dist mahalanobis.dist comp 1 0.5074557 0.2295744 0.2295744 comp 2 0.1709434 0.1483380 0.2025474 $overall max.dist centroids.dist mahalanobis.dist comp 1 0.552 0.256 0.256 comp 2 0.176 0.152 0.232 $error.per.class $error.per.class$max.dist comp 1 comp 2 Fibroblast 0.0000000 0.03333333 hESC 0.9189189 0.32432432 hiPS 0.6034483 0.15517241
Choosing the best keepX in MINT sPLS-DA
The tune function performs LOGOCV for different values of test.keepX provided on each component. Based on on the mean classification error rate, it will output the optimal keepX to be included in the final model.
tune.mint = tune(X = X, Y = Y, study = study, ncomp = 2, test.keepX = c(5, 10, 15, 20), method = 'mint.splsda', progressBar = FALSE) tune.mint #outputs the mean error rate per component and per tested keepX value tune.mint$error.rate comp 1 comp 2 5 0.3930556 0.3902778 10 0.3930556 0.3937500 15 0.3930556 0.3423611 20 0.3930556 0.3277778 # plot the error rates depending on the component plot(tune.mint) # outputs the name of the optimal selected variables tune.mint$choice.keepX.constraint # outputs the optimal keepX in a final MINT model tune.mint$choice.keepX $`comp 1`  5 $`comp 2`  20
For this example, the optimal keepX was indeed 10 on the first component, and 20 on the second component, leading to a mean classification error rate of 32.77%.
Note: A finer test.keepX grid could be run. Needless to say that the tuning step should really be the first step in the MINT analysis!