Stemcells example

P-integration across independent studies with MINT

Here, we illustrate a supervised MINT analysis on a subset of stem cells transcriptomics data, see ?stemcells. The full data set was analysed in [1].

Load the latest version of mixOmics (check the version!).

library(mixOmics)

Rscript

The R script is available at this link (R script and RData)

#Data
In our original study we integrated 15 transcriptomics microarray stem cells datasets to classify three types of human cells: human Fibroblasts (Fib), human Embryonic Stem Cells (hESC) and human induced Pluripotent Stem Cells (hiPSC). Here we illustrate only a subset of those data, including 4 transcriptomics studies (125 samples in total) and the levels of expression of 400 transcripts. The data were obtained from the Stemformatics database (\url{www.stemformatics.org}, [2]).
There exists a biological hierarchy among the three cell types. On the one hand, differences between pluripotent (hiPSC and hESC) and non-pluripotent cells (Fib) are well-characterised and are expected to contribute to the main biological variation. On the other hand, hiPSC are genetically reprogrammed to behave like hESC and both cell types are commonly assumed to be alike. However, differences have been reported in the literature [3-5]. The MINT analysis presented below addresses both subclassification problems in a single analysis.

We first load the data from mixOmics and set up the categorical outcome y and the study membership:

data(stemcells)

#the combined data set X
X = stemcells$gene
dim(X) 
## [1] 125 400
# the outcome vector Y:  
Y = stemcells$celltype 
length(Y) 
## [1] 125
summary(Y)
## Fibroblast       hESC       hiPS 
##         30         37         58
# the vector indicating each independent study
study = stemcells$study
# number of samples per study:
summary(study)
##  1  2  3  4 
## 38 51 21 15
# experimental design
table(Y,study)
##             study
## Y             1  2  3  4
##   Fibroblast  6 18  3  3
##   hESC       20  3  8  6
##   hiPS       12 30 10  6

MINT PLS-DA

We first perform a MINT PLS-DA with all variables included in the model and ncomp = 5 components. The perf function is used to estimate the performance of the model and choose the optimal number of components in $choice.ncomp for our final model.

Elapsed running time is reported in seconds.

mint.plsda.res.perf = mint.plsda(X = X, Y = Y, study = study, ncomp = 5)

set.seed(2543)  # for reproducible result in this example
perf.mint.plsda.cell <- perf(mint.plsda.res.perf, validation = "Mfold", folds = 5, 
                  progressBar = FALSE, auc = TRUE) 
plot(perf.mint.plsda.cell, col = color.mixo(5:7))

plot of chunk unnamed-chunk-4

Based on the performance plot above, ncomp = 1 seems the achieve the best performance for both maximum and centroids distances in terms of BER (see supplemental information from [1] for more details about the prediction distances).

Additional numerical outputs such as the BER and overall error rates per component, and the error rates per class and per prediction distance can be output.

perf.mint.plsda.cell$global.error
## $BER
##         max.dist centroids.dist mahalanobis.dist
## comp 1 0.5189500      0.3504194        0.3504194
## comp 2 0.3738739      0.3370612        0.3797763
## comp 3 0.3796210      0.3828829        0.3558662
## comp 4 0.3886300      0.3821062        0.4016051
## comp 5 0.3616030      0.3616030        0.3050637
## 
## $overall
##        max.dist centroids.dist mahalanobis.dist
## comp 1    0.568          0.456            0.456
## comp 2    0.416          0.392            0.456
## comp 3    0.424          0.424            0.376
## comp 4    0.432          0.432            0.488
## comp 5    0.408          0.408            0.352
## 
## $error.rate.class
## $error.rate.class$max.dist
##               comp 1    comp 2    comp 3    comp 4    comp 5
## Fibroblast 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## hESC       0.9189189 0.6216216 0.6216216 0.6486486 0.5675676
## hiPS       0.6379310 0.5000000 0.5172414 0.5172414 0.5172414
## 
## $error.rate.class$centroids.dist
##               comp 1    comp 2    comp 3    comp 4    comp 5
## Fibroblast 0.0000000 0.0000000 0.0000000 0.0000000 0.0000000
## hESC       0.1891892 0.4594595 0.6486486 0.5945946 0.5675676
## hiPS       0.8620690 0.5517241 0.5000000 0.5517241 0.5172414
## 
## $error.rate.class$mahalanobis.dist
##               comp 1    comp 2     comp 3     comp 4    comp 5
## Fibroblast 0.0000000 0.0000000 0.06666667 0.03333333 0.0000000
## hESC       0.1891892 0.4324324 0.62162162 0.37837838 0.4324324
## hiPS       0.8620690 0.7068966 0.37931034 0.79310345 0.4827586

The optimal number of components:

perf.mint.plsda.cell$choice.ncomp
##         max.dist centroids.dist mahalanobis.dist
## overall        1              1                1
## BER            1              1                1

Despite having to choose 1, 1, 1, 1, 1, 1 component, we run a final MINT PLS-DA model for ncomp = 2 in order to obtain 2D graphical outputs.

mint.plsda.res = mint.plsda(X = X, Y = Y, study = study, ncomp = 2)
#mint.plsda.res # lists the different functions
plotIndiv(mint.plsda.res, legend = TRUE, title = 'MINT PLS-DA', 
          subtitle = 'stem cell study', ellipse = T)

plot of chunk unnamed-chunk-7

The sample plot shows that fibroblasts are separated on the first MINT PLS-DA component. We observe that while deemed not crucial for an optimal discrimination, the second component seems to discriminate hESC and hiPSC.

MINT sparse PLS-DA

The MINT PLS-DA model is built on all 400 genes in X, many of which may be uninformative to characterise the different classes. The MINT sPLS-DA analysis aims to identify a small subset of genes that best discriminate the classes.

Choice of parameters

We first tune the keepX parameter using the tune function for a MINT object. The function performs Leave-One-Group-Out-Cross-Validation (LOGOCV) for different values of test.keepX provided on each component, therefore no repeated CV is needed. Based on the mean classification error rate (overall error rate or BER), it will output the optimal number of component ncomp and the optimal keepX to be included in the final model.

The tuning takes about 35s to run.

tune.mint = tune(X = X, Y = Y, study = study, ncomp = 2, test.keepX = seq(1, 100, 1), 
method = 'mint.splsda', dist = "max.dist", progressBar = FALSE)

# tune.mint   # lists the different types of outputs

# mean error rate per component and per tested keepX value
# tune.mint$error.rate
# optimal number of components
tune.mint$choice.ncomp #tune.mint$choice.ncomp # tell us again than ncomp=1 is sufficient
## $ncomp
## [1] 1
## 
## $values
##      comp 1    comp 2
## 1 0.3888889 0.4555556
## 2 0.3333333 0.2666667
## 3 0.4333333 0.4916667
## 4 0.3888889 0.2222222
# optimal keepX
tune.mint$choice.keepX
## comp 1 comp 2 
##      6     16
plot(tune.mint, col = color.jet(2))

plot of chunk unnamed-chunk-9

Final MINT sPLS-DA model and outputs

Following the tuning results, our final MINT sPLS-DA model is as follows. Note that we still chose a model with 2 components in order to obtain 2D graphics.

mint.splsda.res = mint.splsda(X = X, Y = Y, study = study, ncomp = 2,  
                              keepX = tune.mint$choice.keepX)

#mint.splsda.res   # lists useful functions that can be used with a MINT object

The selectVar function outputs the selected transcripts on the first component along with their loading weight value. We consider variables as important in the model when their absolute loading weight value is high. In addition to this output, we could compare the stability of the selected features across studies (see Example in the PLS-DA analysis in [1]).

selectVar(mint.splsda.res, comp = 1)
## $name
## [1] "ENSG00000181449" "ENSG00000123080" "ENSG00000110721" "ENSG00000176485"
## [5] "ENSG00000184697" "ENSG00000102935"
## 
## $value
##                   value.var
## ENSG00000181449 -0.70533253
## ENSG00000123080  0.48230598
## ENSG00000110721 -0.44201007
## ENSG00000176485 -0.23247562
## ENSG00000184697 -0.10494626
## ENSG00000102935 -0.09723903
## 
## $comp
## [1] 1

Sample plots

The samples can be projected on the global components:

plotIndiv(mint.splsda.res, study = 'global', legend = TRUE, title = 'MINT sPLS-DA', 
          subtitle = 'Global', ellipse=T)

plot of chunk unnamed-chunk-12

Or, alternatively using the MINT PLS-component from each study. This options allows us to examine each study individually.

plotIndiv(mint.splsda.res, study = 'all.partial',  title = 'MINT sPLS-DA', 
          subtitle = paste("Study",1:4))

plot of chunk unnamed-chunk-13

Additionally, the plotArrow function displays each sample from the X-components (start of the arrow) to the Y-components (end of the arrow):

plotArrow(mint.splsda.res)

plot of chunk unnamed-chunk-14

Correlation circle plots

We can have a look at our molecular signature selected with MINT sPLS-DA. Our favourite correlation circle plot highlights the contribution of each selected transcripts to each component (close to the large circle). See also more examples here.

plotVar(mint.splsda.res, cex = 4)

plot of chunk unnamed-chunk-15

CIM

A Clustered Image Map including the final gene signature is plotted (default values to Euclidian distance and Complete linkage). Specific component can be also be chosen.

cim(mint.splsda.res, comp = 1, margins=c(10,5), 
    row.sideColors = color.mixo(as.numeric(Y)), row.names = FALSE,
    title = "MINT sPLS-DA, component 1")