# FAQ

The process of compiling the most frequent questions and appending them to this section is an on-going effort. If however your issue is not found on this page, check our Discussion forum or our GitHub issue page. And if you still cannot find an answer, post a question on our Discussion forum!

The following questions are organised as such:

- Data Input
- Pre-processing
- Parameter Tuning
- Numerical Outputs
- Graphical Outputs
- Miscellaneous

**Data Input**

**If I am using the (s)PLS framework to integrate my data, is the rCCA framework also applicable?**

Yes, sPLS has two principal modes: regression and canonical. With the latter, the **X** (eg. genomics) and the **Y** dataframe (eg. metabolics) are used in a bidirectional manner to explain one another. Similar to sPLS canonical mode, rCCA models a bidirectional relationship to uncover the correlation structure between the two data sets. The primary difference is that sPLS seeks for components which maximise the covariance between datasets rather than their correlation. As the number of variables is likely to be much larger than the number of samples, you will have to choose the regularization parameters.

If a unidirectional relationship is desired, look at using the regression mode of the sPLS methodology.

Useful links:

**Can **`mixOmics`

perform an analysis for 3 (or more) datasets at one time or should I perform a pairwise analysis of two datasets?

`mixOmics`

perform an analysis for 3 (or more) datasets at one time or should I perform a pairwise analysis of two datasets?It sure can. This type of methodology is refered to as **multiblock** within the `mixOmics`

package.

If your datasets (of different omics types) are measured across the same *samples*, the N-integration framework will be very powerful. If they are of the same omics type and measure the same *features*, the P-integration framework is applicable.

Useful links:

**Can **`mixOmics`

handle multiclass data?

`mixOmics`

handle multiclass data?Yes in the case of a classification context, you can use PLS-DA if you are interested in discriminating the classes. The sparse version sPLS-DA is useful to identify discriminative features. The other approaches are purely unsupervised and therefore do not take into account the classes of the samples.

Useful links:

**Can it combine SNP data with gene and/or metabolite expression data?**

As long as the data are measured on the same matching samples, it should work.

Useful links:

**Pre-processing**

**Should I normalize the data before using **`mixOmics`

?

`mixOmics`

?Yes, in nearly all cases the data should be normalized and possibly pre-processed beforehand. `mixOmics`

will accept any type of data (raw, normalized). Be aware that PLS methods will center and scale the variables by default. For the PCA methods, users will need to specify `center = TRUE`

and `scale = TRUE`

. rCCA takes the data as such (i.e. no transformation).

**What should I do with missing values?**

`mixOmics`

methods such as **(s)PLS**, **(s)PLS-DA** and **(s)PCA** use the **NIPALS** algorithm to handle missing values (ie. **NA**s). **rCCA** can handle missing values when the cross validation approach is used. If the shrinkage method is being applied, user derived imputation is required.

Useful links:

**Parameter Tuning**

**When undergoing rCCA, what are the default values of the regularisation parameters?**

When using the “ridge” method in rCCA, the \(\lambda1\) and \(\lambda2\) are tuned externally (via the `tune.rcc()`

function). There are passed as parameters to `rcc()`

. They do have default values (internally default to `0`

), but these are not relevant as then there will be no penalisation. These values are strongly dependant on the data itself and should always be specifically tuned.

Useful links:

**When running **`rcc()`

, I have run into the following error:

`rcc()`

, I have run into the following error:`chol.default(Cxx) : the leading minor of order 30 is not positive definite`

This is mostly likely to occur when encountering singular matrices, where the total number of variables from both data sets is much larger than the number of samples. We suggest using regularized CCA instead (i.e. with regularization parameters \(\lambda1\) and \(\lambda2\) greater than 0) or (s)PLS in canonical mode.

Useful links:

**I'm looking at the output of **`tune.splsda()`

. What distance metric should I use and should I use the overall or balanced error rate?

`tune.splsda()`

. What distance metric should I use and should I use the overall or balanced error rate?This question is an extremely important one. As would be expected, it is entirely case dependent so there is no universal rule. However, there are a few markers we can use to guide our decision making. We will consider a sPLS-DA methodology for the remainder of this answer.

In regards to what distance metric to use, this paper provides a detailed breakdown. The key points is that generally the centroid based methods (`centroids.dist`

and `mahalanobis.dist`

) lead to better predictions. This is particularly true in complex classification problems and N-integration problems – and in these contexts the Mahalanobis distance usually performs the best. The maximum distance is great in simpler scenarios and when interpretability is important.

When deciding which error rate (overall or balanced) to use, it is dependent on the distribution of classes in the response vector. If it is unbalanced, such that each factor of the response variable has different counts, the balanced error rate (BER) is more appropriate. BER calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. Therefore, contrary to ER, BER is less biased towards majority classes during the evaluation.

Useful links:

**When undergoing feature count tuning, how should I select my grid of **`keepX`

values?

`keepX`

values?Unfortunately, this common question does not have a definitive answer and is very dependent on the context. Using an iterative tuning process will save time in the long run. In other words, don't use a `keepX`

grid that contains every value between 1 and 300 (`seq(1, 300, 1)`

). Start with a lower 'resolution' and increase this in later iterations. For example, run the first round of tuning using `seq(10, 300, 20)`

such that only every twentieth value after 10 is tried. Lets say that for the first component, the optimal value was `170`

. From here, run the tuning process again, but this time use a smaller range with a higher granularity (ie. smaller interval), like `seq(120, 220, 5)`

. Once this has selected a value (eg. `180`

) run one last tuning with `seq(170, 190, 1)`

.

This iterative process can be avoided depending on the research question. If the aim is to identify a small set of predictors, then focus the grid around lower values (eg. `seq(1, 20, 1)`

).

**Numerical Outputs**

**I have used **`plotVar()`

and would like the correlation coordinate for a specific variable. How do I do this?

`plotVar()`

and would like the correlation coordinate for a specific variable. How do I do this?The below code will provide the answer. By indexing the output matrix (`simMat`

) as follows, the exact value can be determined: `simMat[desired.X.feature, desired.Y.feature]`

.

```
nutri.res <- rcc(X, Y, ncomp = 3, lambda1 = 0.064, lambda2 = 0.008) # run your analysis
# calculate sum of each samples' position on each variate across the two datasets
bisect = nutri.res$variates$X + nutri.res$variates$Y
# find correlation between original X dataframe and each summed component
cord.X = cor(nutri.res$X, bisect, use = "pairwise")
# find correlation between original Y dataframe and each summed component
cord.Y = cor(nutri.res$Y, bisect, use = "pairwise")
# place all correlation values into a parsable matrix
simMat = as.matrix(cord.X %*% t(cord.Y))
```

Useful links:

**How can I access the predicted classes of the test samples when using **`predict()`

on the result of `plsda()`

/`splsda()`

?

`predict()`

on the result of `plsda()`

/`splsda()`

?The output object contains the `class`

attribute. For each distance metric used, within `class`

there will be a dataframe correpsonding to that metric. The rows of this dataframe correspond to the test samples while each column represents the predictions made for each sample using models of increasing component count (up to theW inputted `ncomp`

).

The following call was used (derives from the sPLS-DA case study (*1*)):

```
# train the model
train.splsda.srbct <- splsda(X.train, Y.train, ncomp = 3, keepX = c(25,25,25))
# use the model on the Xtest set
predict.splsda.srbct <- predict(train.splsda.srbct, X.test, dist = "mahalanobis.dist")
```

```
predict.splsda.srbct$class$mahalanobis.dist
```

## comp1 comp2 comp3 ## EWS.T3 "EWS" "EWS" "EWS" ## EWS.T6 "RMS" "EWS" "EWS" ## EWS.T13 "RMS" "EWS" "EWS" ## EWS.C2 "NB" "RMS" "EWS" ## EWS.C11 "EWS" "EWS" "EWS" ## BL.C2 "BL" "BL" "BL" ## RMS.C6 "EWS" "RMS" "RMS" ## RMS.C8 "NB" "RMS" "RMS" ## RMS.C11 "NB" "RMS" "RMS" ## RMS.T1 "RMS" "RMS" "RMS" ## RMS.T4 "RMS" "RMS" "RMS" ## RMS.T10 "NB" "EWS" "RMS" ## RMS.T11 "RMS" "EWS" "RMS"

Useful links:

**Why does the **`explained_variance`

output from methods such as PLS-DA not decrease across components like PCA?

`explained_variance`

output from methods such as PLS-DA not decrease across components like PCA?PCA maximises the variance of the component related to the inputted dataframe in an iterative process. Hence, as more components are formed, each accounts for less of the total variance. Other methods within `mixOmics`

(eg. (s)PLS-DA) form their components via different means (eg. via the maximisation of covariance between components to best separate classes). Therefore, PCA is the only method where a consistent decrease in captured variance across components is *guaranteed*.

Useful links:

**How to extract the names of variables selected when performing sPLS?**

In the future, a more explicit and robust way to extract the selected features will be implemented by the `mixOmics`

team. Until then, there are two simple ways to achieve this.

Firstly is the use of the `plotLoadings()`

function. It won't plot any features with loading values equal to 0. In contexts where the number of features used (`keepX`

and `keepY`

) is fairly low, the selected features can easily be read off this plot.

Secondly, the below code can be used.

```
comp = 1 # select which component you are inspecting
which(spls.object$loadings$X[, comp] != 0) # show all features with non-zero loading values
```

To view the exact loading values for each feature in decreasing order:

```
features.to.view <- 10 # how many features do you want to look at
loadingsX1 = abs(spls.object$loadings$X[, comp]) # extract the absolute loading values
sort(loadingsX1, decreasing = TRUE)[1:features.to.view] # sort them and print out the desired quantity
```

Useful links:

**Graphical Outputs**

**How can I interpret the various plots provided by **`mixOmics`

?

`mixOmics`

?This question is dependent on the type of analysis being undergone as well as the biological question under inspection. In the case studies on this site (*2-5*), there are brief interpretations and explanations on how to draw conclusions from the various plots `mixOmics`

offers. Also going through the functionality of the graphcics themselves will aid in determining how to interpret them (*1*).

For a deeper understanding of how to interpret a given plot, refer to this paper.

Useful links:

- 1. Graphics page
- 2. (s)PCA Case Study
- 3. (s)PLS-DA Case Study
- 4. rCCA Case Study
- 5. (s)PLS Case Study

**What is the difference between the **`cutoff`

parameter used in `plotVar()`

and `network()`

?

`cutoff`

parameter used in `plotVar()`

and `network()`

?In `plotVar()`

, variables with at least one coordinate (in absolute value) on any component (axes of the correlation circle plot) above of the cutoff are plotted.

In `network()`

, a similarity measure between each **X** and **Y** variable is calculated. Variable pairs with a similarity measure (in absolute value) above the threshold are plotted.

The way to think about the difference between these two is that in `network()`

, the `cutoff`

controls which pairs of features are shown based on their association with one another. In `plotVar()`

, it is the correlation of each feature and the components which `cutoff`

controls.

Useful links:

**Regarding **`cim()`

function, can I change the distances used?

`cim()`

function, can I change the distances used?The clustering method and distance used as part of this clustering can both be adjusted via the `clust.method`

and `dist.method`

parameters of the `cim()`

function.

`clust.method`

takes a character vector of length two (for rows and columns respectively). The default value is `"complete"`

. All the values that `hclust()`

accepts are usable as part of `cim()`

(see here for more information).

`dist.method`

also takes a character vector of length two. It defaults to `"euclidean"`

but can take any value which is supported by the `dist()`

function (see here for more information).

Useful links:

**Regarding cim() function, how can I get the identity of the objects clustered together?**

After you have created your `cim`

object, the dendrogram of the columns and the rows can be visualised via `plot(cim.object$ddc)`

and `plot(cim.object$ddr)`

respectively. The order in which the features are clustered can be accessed via: `as.hclust(cim.object$ddr)$order`

.

Useful links:

**How can I export a relevance network to Cytoscape?**

Once the network plot has been produced, it should be exported to `graphml`

format. Note that the `network`

object has an attribute (`$gR`

) which is a graph object designed for Cytoscape use. This does require loading of the `igraph`

package. To export the graph, use the code below (changing the object and file name):

```
write.graph(network.result$gR, file = "network_graph.graphml", format = "graphml")
```

This network information file could then nicely be opened in Cytoscape via:

- File -> Import -> Network (Multiple File Types)

Note: be sure that within Cytoscape the GraphML reader plugin is installed (should be available @ Plugins -> Manage Plugins; select from Network and Attribute I/O); otherwise download from here.

*This answer was provided by Guido Hooiveld (thanks!)*

Useful links:

**Miscellaneous**

**How do I cite mixOmics?**

It is dependent on context, but see here to determine which paper to cite.

**Is there a way by which gene and metabolite connection information can be incorporated in the **`mixOmics`

analysis?

`mixOmics`

analysis?All the `mixOmics`

methods are entirely data-driven and do not include any a priori information. Despite this, the `network()`

and `cim()`

functions can provide key insights into the relationships between datasets, ie. between genetic and metabolic data.

Useful links:

**Why does **`plsda()`

return an error message even though I am using the same code as indicated in the tutorial or the help file?

`plsda()`

return an error message even though I am using the same code as indicated in the tutorial or the help file?This is highly likely due to a clash with another package that also provides a `plsda()`

function (`caret`

or `muma`

for instance; the `spls`

package provides a `splsda()`

function) that was loaded before `mixOmics`

. You can either unload these packages using the function `detach()`

or use the command `mixOmics::plsda(...)`

to be force the `plsda()`

function from `mixOmics`

to be run.

**What does the error **`system is computationally singular`

mean when undergoing (s)PLS-DA?

`system is computationally singular`

mean when undergoing (s)PLS-DA?There are a few different causes of this error and is context dependent. A few of the most likely sources are:

- A large degree of sparsity within your data, such that there is a large number of missing or zero values. One suggestion would be to remove all features that have more than a certain threshold (eg. 20%) of zero/missing value. If this does not work, refer to the Missing Values page on the site (link directly below) on how to impute these values manually.
- You are trying to create a (s)PLS-DA model with too many components. This is likely to result in large, empty matrices. Try reducing the number of components within your model.
- The features you are inputting have a high degree of multi-collinearity such that they are extremely similar in their distributions. Before generating the model, assess your features for their collinearity and remove features which are extremely similar to others.

Useful links:

**Why are the outputs of the **`mixOmics`

(s)PLS-DA method different to the output of packages such as`DiscriMiner`

, `SIMCA`

and `ropls`

?

`mixOmics`

(s)PLS-DA method different to the output of packages such as`DiscriMiner`

, `SIMCA`

and `ropls`

?Users have noted that the \(R^{2}\), \(Q^{2}\), loading values and error rates of (s)PLS-DA models are different when using these different packages compared to the `mixOmics`

version. In the case of `SIMCA`

, it is hard to tell as they have not publicly released the algorithm they use. The algorithm proposed by Wold et al., 2001 is what is used by `mixOmics`

. This also involves norming the **X** and dummy encoded **Y** dataframes.

Irrespective of the algorithm used by these packages, there are going to be inherent differences induced by differing folds during the cross-validation procedure and the prediction formula used by these packages. Also, we are unaware of whether these packages centre and scale the data (as `mixOmics`

does).