FAQ – mixOmics

If your issue is not found on this page, check our Discussion forum or our GitHub issue page. And if you still cannot find an answer, post a question on our Discussion forum!

The following questions are organised as such:

mixOmics methods
Data Input
Pre-processing
Parameter Tuning
Numerical Outputs
Graphical Outputs
Miscellaneous

mixOmics methods

What’s the difference between (s)PLS and rCCA ?

Yes, sPLS has two principal modes: regression and canonical. With canonical, the X (e.g. genomics) and the Y dataframe (e.g. metabolics) are used in a bidirectional manner to explain one another, i.e. the analysis is symmetrical, also called unsupervised. Similar to sPLS canonical mode, rCCA models a bidirectional relationship to uncover the correlation structure between the two data sets. The main difference is that sPLS looks for components which maximise the covariance between datasets, whilst rCCA maximises their correlation. As the number of variables is likely to be much larger than the number of samples, you will have to choose the regularization parameters for rCCA.

If a unidirectional relationship is desired i.e. a supervised analysis where X explains Y, look at using the regression mode of the sPLS method.

Useful links:
(s)PLS Case Study
rCCA Methods page
rCCA Case Study
(s)PLS Methods page

Can mixOmics perform an analysis for 3 (or more) datasets at one time or should I perform a pairwise analysis of two datasets?

Yes – this type of methodology is refered to as multiblock within the mixOmics package. If your datasets (of different omics types) are measured across the same samples, the N-integration framework will be very powerful. If they are of the same omics type and measure the same features, the P-integration framework is applicable.

Useful links:
N-integration Methods Page (N-integration)
MINT Methods Page (P-integration)

I have groups of samples, can I use omics to discriminate the groups?

Yes, you can use -DA methods if you are interested in discriminating classes (i.e. groups of samples). The sparse versions of -DA methods are useful to identify discriminative features. Current -DA methods are (s)PLS-DA, block (s)PLS-DA, MINT (s)PLS-DA and MINT block (s)PLS-DA.

Useful links:
(s)PLS-DA Case Study
(s)PLS-DA Methods Page

Data Input

How should my input data be formatted?

The input data should be in the data frame format, with samples (N) in rows and variables (P) in columns. Each data frame should include one type of omics, e.g. if you want to integrate transcriptomics and metabolomics, you will have two dataframes which you can connect in a list.

Can I combine SNP data with gene and/or metabolite expression data?

Genotype data like SNP data can be used with mixOmics but is of limited use. This is because SNPs have a very small, additive effect on phenotype. This means sparse methods to select most discriminative SNPs would not be useful. We recommend calculating polygenic response using other data available from consortiums and then use this information as your phenotype to integrate the genotype data with other modalities like gene/metabolite data. Although its worth giving mixOmics a try, you may need to consider other tools if you are working with genotype data.

Can I combine microbiome data with gene and/or metabolite expression data?

Microbiome data can be used within the mixOmics framework. Microbiome data needs to be transformed because it is compositional, and filtered because it is sparse (i.e. has many zeros). The mixMC method within mixOmics provides tools to perform this pre-processing.

Useful links:
mixMC Preprocessing

Pre-processing

Do I need to normalise my data?

Yes, in nearly all cases the data should be normalised and possibly pre-processed beforehand. mixOmics will accept any type of data (raw, normalised). Be aware that PLS methods will centre and scale the variables by default. For the PCA methods, users will need to specify center = TRUE and scale = TRUE. rCCA takes the data as is (i.e. no transformation).

Do I need to filter my data?

We recommend filtering large datasets to keep at most 10,000 variables per modality. Filtering can include removing variables with very low overall expression levels or variables with low variance. The mixOmics function nearZeroVar can be used to identify variables with low variance to remove them. Alternatively the near.zero.var parameter can be used during model building.

What should I do with missing values?

If you have variables with large amounts of missing values (NAs), consider removing them. If there are only a few missing values (less than 20% of total values in the dataset) they can be imputed (i.e. estimated).

mixOmics methods such as (s)PLS, (s)PLS-DA and (s)PCA use the NIPALS algorithm to impute missing values. rCCA can handle missing values when the cross validation approach is used. If the shrinkage method is being applied, imputation outside of mixOmics is required.

Useful links:
Missing Values

Parameter Tuning

What does ‘overfitting’ mean and how can I avoid it?

Overfitting happens when a model learns the training data too well, and takes into account noise and irrelevant details to help it make predictions. Overfitted models perform poorly when tested on external test data. The mixOmics tuning and model fitting framework helps to avoid overfitting. This framework involves: tuning to choose the right number of components and (in the case of sparse models) variables, perform model assessment using cross-validation and use the model to make predictions on unseen test data.

Useful links:
Performance assessment and parameter tuning workflow

When undergoing rCCA, what are the default values of the regularisation parameters?

When using the “ridge” method in rCCA, the lambda1 and lambda2 are tuned externally (via the tune.rcc() function). There are passed as parameters to rcc(). They do have default values (internally default to 0), but these are not relevant as then there will be no penalisation. These values are strongly dependant on the data itself and should always be specifically tuned.
Useful links:
rCCA Methods page
rCCA Case Study

I’m looking at the output of tune.splsda(). What distance metric should I use and should I use the overall or balanced error rate?

This question is an extremely important one. As would be expected, it is entirely case dependent so there is no universal rule. However, there are a few markers we can use to guide our decision making. We will consider a sPLS-DA methodology for the remainder of this answer.

In regards to what distance metric to use, this paper provides a detailed breakdown. The key points is that generally the centroid based methods (centroids.dist and mahalanobis.dist) lead to better predictions. This is particularly true in complex classification problems and N-integration problems – and in these contexts the Mahalanobis distance usually performs the best. The maximum distance is great in simpler scenarios and when interpretability is important.

When deciding which error rate (overall or balanced) to use, it is dependent on the distribution of classes in the response vector. If it is unbalanced, such that each factor of the response variable has different counts, the balanced error rate (BER) is more appropriate. BER calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. Therefore, contrary to ER, BER is less biased towards majority classes during the evaluation.

Useful links:
(s)PLS-DA Case Study
(s)PLS-DA Methods Page

When undergoing feature count tuning, how should I select my grid of keepX values?

Unfortunately, this common question does not have a definitive answer and is very dependent on the context. Using an iterative tuning process will save time in the long run. In other words, don’t use a keepX grid that contains every value between 1 and 300 (seq(1, 300, 1)). Start with a lower ‘resolution’ and increase this in later iterations. For example, run the first round of tuning using seq(10, 300, 20) such that only every twentieth value after 10 is tried. Lets say that for the first component, the optimal value was 170. From here, run the tuning process again, but this time use a smaller range with a higher granularity (ie. smaller interval), like seq(120, 220, 5). Once this has selected a value (eg. 180) run one last tuning with seq(170, 190, 1).

This iterative process can be avoided depending on the research question. If the aim is to identify a small set of predictors, then focus the grid around lower values (eg. seq(1, 20, 1)).

Numerical Outputs

I have used plotVar() and would like the correlation coordinate for a specific variable. How do I do this?

The below code will provide the answer. By indexing the output matrix (simMat) as follows, the exact value can be determined: simMat[desired.X.feature, desired.Y.feature].

nutri.res <- rcc(X, Y, ncomp = 3, lambda1 = 0.064, lambda2 = 0.008) # run your analysis

# calculate sum of each samples' position on each variate across the two datasets
bisect = nutri.res$variates$X + nutri.res$variates$Y

# find correlation between original X dataframe and each summed component
cord.X = cor(nutri.res$X, bisect, use = "pairwise")
# find correlation between original Y dataframe and each summed component
cord.Y = cor(nutri.res$Y, bisect, use = "pairwise")

# place all correlation values into a parsable matrix
simMat = as.matrix(cord.X %*% t(cord.Y))

plotVar() Graphics Page

How can I access the predicted classes of the test samples when using predict() on the result of plsda()/splsda()?

The output object contains the class attribute. For each distance metric used, within class there will be a dataframe correpsonding to that metric. The rows of this dataframe correspond to the test samples while each column represents the predictions made for each sample using models of increasing component count (up to the W inputted ncomp).

The following call was used (derives from the sPLS-DA case study (1)):

# train the model
train.splsda.srbct <- splsda(X.train, Y.train, ncomp = 3, keepX = c(25,25,25))

# use the model on the Xtest set
predict.splsda.srbct <- predict(train.splsda.srbct, X.test, dist = "mahalanobis.dist")

predict.splsda.srbct$class$mahalanobis.dist

##         comp1 comp2 comp3
## EWS.T3  "EWS" "EWS" "EWS"
## EWS.T6  "RMS" "EWS" "EWS"
## EWS.T13 "RMS" "EWS" "EWS"
## EWS.C2  "NB"  "RMS" "EWS"
## EWS.C11 "EWS" "EWS" "EWS"
## BL.C2   "BL"  "BL"  "BL" 
## RMS.C6  "EWS" "RMS" "RMS"
## RMS.C8  "NB"  "RMS" "RMS"
## RMS.C11 "NB"  "RMS" "RMS"
## RMS.T1  "RMS" "RMS" "RMS"
## RMS.T4  "RMS" "RMS" "RMS"
## RMS.T10 "NB"  "EWS" "RMS"
## RMS.T11 "RMS" "EWS" "RMS"

Useful links:

(s)PLS-DA Case Study

Why does the explained_variance output from methods such as PLS-DA not decrease across components like PCA?

PCA maximises the variance of the component related to the inputted dataframe in an iterative process. Hence, as more components are formed, each accounts for less of the total variance. Other methods within mixOmics (e.g. (s)PLS-DA) form their components via different means (e.g. via the maximisation of covariance between components to best separate classes). Therefore, PCA is the only method where a consistent decrease in captured variance across components is guaranteed.

Useful links:

How to extract the names of variables selected when performing sPLS?

In the future, a more explicit and robust way to extract the selected features will be implemented by the mixOmics team. Until then, there are two simple ways to achieve this.

Firstly is the use of the plotLoadings() function. It won’t plot any features with loading values equal to 0. In contexts where the number of features used (keepX and keepY) is fairly low, the selected features can easily be read off this plot.

Secondly, the below code can be used.

comp = 1 # select which component you are inspecting
which(spls.object$loadings$X[, comp] != 0) # show all features with non-zero loading values

To view the exact loading values for each feature in decreasing order:

features.to.view <- 10 # how many features do you want to look at
loadingsX1 = abs(spls.object$loadings$X[, comp]) # extract the absolute loading values
sort(loadingsX1, decreasing = TRUE)[1:features.to.view] # sort them and print out the desired quantity

Useful links:

(s)PLS Methods Page

Graphical Outputs

How can I interpret the various plots provided by mixOmics?

This question is dependent on the type of analysis being undergone as well as the biological question under inspection. In the case studies on this site, there are brief interpretations and explanations on how to draw conclusions from the various plots mixOmics offers. Also going through the functionality of the graphics themselves will aid in determining how to interpret them.

For a deeper understanding of how to interpret a given plot, refer to this paper.

Useful links:

(s)PLS Case Study
Graphics page
(s)PCA Case Study
(s)PLS-DA Case Study
rCCA Case Study

What is the difference between the cutoff parameter used in plotVar() and network()?

In plotVar(), variables with at least one coordinate (in absolute value) on any component (axes of the correlation circle plot) above of the cutoff are plotted.

In network(), a similarity measure between each X and Y variable is calculated. Variable pairs with a similarity measure (in absolute value) above the threshold are plotted.

The way to think about the difference between these two is that in network(), the cutoff controls which pairs of features are shown based on their association with one another. In plotVar(), it is the correlation of each feature and the components which cutoff controls.

Useful links:
network() Graphics Page
plotVar() Graphics Page

Regarding the cim() function, can I change the distances used?

The clustering method and distance used as part of this clustering can both be adjusted via the clust.method and dist.method parameters of the cim() function.

clust.method takes a character vector of length two (for rows and columns respectively). The default value is "complete". All the values that hclust() accepts are usable as part of cim() (see here for more information).

dist.method also takes a character vector of length two. It defaults to "euclidean" but can take any value which is supported by the dist() function (see here for more information).

Useful links:
cim() Graphics Page

Regarding cim() function, how can I get the identity of the objects clustered together?

After you have created your cim object, the dendrogram of the columns and the rows can be visualised via plot(cim.object$ddc) and plot(cim.object$ddr) respectively. The order in which the features are clustered can be accessed via: as.hclust(cim.object$ddr)$order.

Useful links:
cim() Graphics Page

How can I export a relevance network to Cytoscape?

Once the network plot has been produced, it should be exported to graphml format. Note that the network object has an attribute ($gR) which is a graph object designed for Cytoscape use. This does require loading of the igraph package. To export the graph, use the code below (changing the object and file name):

write.graph(network.result$gR, file = "network_graph.graphml", format = "graphml")

This network information file could then nicely be opened in Cytoscape via:

File -> Import -> Network (Multiple File Types)

Note: be sure that within Cytoscape the GraphML reader plugin is installed (should be available @ Plugins -> Manage Plugins; select from Network and Attribute I/O); otherwise download from here.

This answer was provided by Guido Hooiveld (thanks!)

Useful links:

network() Graphics Page

Miscellaneous

How do I cite mixOmics?

It is dependent on context, but see here to determine which paper to cite.

Is there a way by which gene and metabolite connection information can be incorporated in the mixOmics analysis?

All the mixOmics methods are entirely data-driven and do not include any a priori information. Despite this, the network() and cim() functions can provide key insights into the relationships between datasets, ie. between genetic and metabolic data.

Useful links:
network() Graphics Page
cim() Graphics Page

Why does plsda() return an error message even though I am using the same code as indicated in the tutorial or the help file?

This is highly likely due to a clash with another package that also provides a plsda() function (caret or muma for instance; the spls package provides a splsda() function) that was loaded before mixOmics. You can either unload these packages using the function detach() or use the command mixOmics::plsda(...) to be force the plsda() function from mixOmics to be run.

Why are the outputs of the mixOmics (s)PLS-DA method different to the output of packages such asDiscriMiner, SIMCA and ropls?

Users have noted that the R^{2}, Q^{2}, loading values and error rates of (s)PLS-DA models are different when using these different packages compared to the mixOmics version. In the case of SIMCA, it is hard to tell as they have not publicly released the algorithm they use. The algorithm proposed by Wold et al., 2001 is what is used by mixOmics. This also involves norming the X and dummy encoded Y dataframes.

Irrespective of the algorithm used by these packages, there are going to be inherent differences induced by differing folds during the cross-validation procedure and the prediction formula used by these packages. Also, we are unaware of whether these packages centre and scale the data (as mixOmics does).