Parameter Tuning

Once you have chosen your statistical model, you need to tune the parameters of the model (Step 2 in ‘Performance Assessment and Hyperparameter Tuning’). The tune() function uses cross-validation to aids the choice of parameters for model building.

Which models can be tuned

Currently, the tune() function can be applied to the following models:

(s)PCA
rcc
(s)PLS (including sPLS1)
(s)PLS-DA
block (s)PLS-DA
MINT (s)PLS-DA

We are currently working on expanding the functionality of tune() to also work on other models such as block (s)PLS and other MINT variants. For now the parameters used in these other models has to be decided by the user without tuning. To see which parameters you need to input for these models, look at the help pages e.g. by running ?block.pls.

Which parameters to tune

Which parameters you need to tune depends on your model. For all models (except sPCA and rcc) you need to decide how many components to include. For sparse models you need to choose which variables to keep and for classification (i.e. -DA) models you need to choose which distance metrics to use.

FIGURE 1: Table showing which parameters (ncomp, dist and keepX/keepY) need to be tuned for which models. Note this table only includes methods for which their parameters can be tuned using tune().

How to tune parameters

All three parameters (components, variables and distance metrics) can be tuned using the function tune(). tune() runs cross-validation (as described in ‘Performance Assessment and Hyperparameter Tuning’) to identify parameter combinations which give the best performaning model. It is possible to simultaneously tune all three components, however this results in the generation and testing of many models which can be computationally intensive and take a long time. Instead, we suggest first tuning for number of components and optionally distance metrics, and then to tune number of variables to keep.

Tuning number of components and distance metric

Tuning the number of components for non-sparse methods is simply done by calling tune() on the input X and Y data and choosing how many components to test with ncomp.

As described in ‘Performance Assessment and Hyperparameter Tuning’ there are the parameters you can use to choose how to run cross-validation, here we have arbritarily chosen to use 5 folds and repeat the CV 3 times.

Here are some examples of how to tune the number of components to keep, and in the case of classification models also the distance metric to use. Note that tuning can be done by calling the specific tune function directly (e.g. tune.pls()) or calling the tune wrapper and specifying the method (e.g. tune(method = 'pls)).

PLS (regression model) example

library(mixOmics)

# load data
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

# run tuning on 5 components, may take a few seconds
tune.res <- tune.pls(X, Y, ncomp = 5,
                 folds = 5, nrepeat = 3, 
                 seed = 12) # for reproducibility with example, remove for own analysis

# plot output
plot(tune.res)

FIGURE 2: Performance assessment metric MSEP plotted for each output Y variable across 5 components for PLS method. The plot suggests that adding more than 1 component does not significantly improve model accuracy.

PLS-DA (classification model) example

# load data
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- as.factor(breast.tumors$sample$treatment)

# run tuning on 5 components, may take a few seconds
tune.res <- tune(X, Y, ncomp = 5,
                 method = "plsda",
                 folds = 5, nrepeat = 3, 
                 dist = "all",
                 seed = 12)

# plot output
plot(tune.res)

FIGURE 3: Classification error rate plotted across 5 components and 3 distance metrics for PLS-DA method. The plot suggests that adding more than 1 component does not significantly improve model accuracy, and that the distance metric max.dist is most accurate at one component.

For classification models, as well as plotting the performance metrics across multiple components, tune() calculates the optimal number of components per distance metric, which can be extracted using tune.res$choice.ncomp. Here, as suggested by the plot, the optimal number of components is 1. The plot also suggests that the best distance metric to use is max.dist as it gives the lowest classification error rate for the 1 component model. See here for more information on the three different distance metrics used in mixOmics.

tune.res$choice.ncomp

##         max.dist centroids.dist mahalanobis.dist
## overall        1              1                1
## BER            1              1                1

Tuning sparse models (e.g. sPLS-DA), the same procedure is used, however the number of variables to test should be set to NULL to only tune number of components. Note that when test.keepX is set to NULL, tune() will print a message to state that it is only tuning for number of components.

sPLS-DA (classification model) example

# run tuning on 5 components, may take a few seconds
tune.res <- tune(X, Y, ncomp = 5,
                 method = "splsda",
                 folds = 5, nrepeat = 3, 
                 test.keepX = NULL,
                 dist = "all",
                 seed = 12)

## [1] "test.keepX set to NULL, tuning only for number of components..."

# print optimal ncomp
tune.res$choice.ncomp

##         max.dist centroids.dist mahalanobis.dist
## overall        1              1                1
## BER            1              1                1

Tuning the number of variables

For sparse methods, lasso penalisation is applied to reduce the loading values of some features to zero, therefore only retaining features which are relevant to the model. The tune() function can be used to identify what is the optimal number of features to be retained.

By inputting a grid of variable numbers, e.g. test.keepX = c(5, 10, 15), cross-validation is performed on each number of variables for each component. The tune() function returns the set of variables to select that achieves the best predictive performance for each of the components that are progressively added in the model, while favouring a small signature.

As shown above, the optimal number of components for the example sPLS-DA model is 1 and the optimal distance metric is max.dist. These optimised metrics can now be used to reduce the number of models that need to be tested to choose the number of variables to keep. Here we will increase ncomp to 2 for plotting reasons.

sPLS-DA (classification model) example

# run tune on three different keepX values (may take a few seconds)
tune.res <- tune(method = "splsda", X, Y, 
            ncomp = 2, 
            dist = "max.dist",
            folds = 5, nrepeat = 3,
            test.keepX = c(5, 10, 15),
            seed = 12)

# plot the resulting output per test.keepX value
plot(tune.res)

FIGURE 4: Balanced error rate plotted across 2 components using 5, 10 and 15 X variables for PLS-DA method. The plot suggests that the sPLS-DA models with one component are more accurate than those with 2 components across all numbers of selected features. For the model made with 1 component, keeping 10 features produces the most accurate model. For the model made with 2 components, keeping 15 features produces the most accurate model.

Like ncomp, the optimal number of choice.keepX can be extracted from the tune() output. In agreement with the plot above, this is 10.

tune.res$choice.keepX

## comp1 comp2 
##    10    15

sPLS (regression model) example

For sPLS models, both the number of X and Y variables can be tuned. In this case, the optimal combination of keepX and keepY is identified.

# set up data
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

# tune across a grid of test.keepX and test.keepY for component 1
tune.res <- tune.spls(X, Y, ncomp = 1,
                      test.keepX = c(5, 10, 15),
                      test.keepY = c(3, 6, 8),
                      folds = 5, nrepeat = 3)
# print the optimal keepX and keepY values for component 1
tune.res$choice.keepX

## comp1 
##     5

tune.res$choice.keepY

## comp1 
##     3

# plot the outputting performance assessment for each keepX and keepY combination
plot(tune.res)

FIGURE 5: Tuning plot showing correlation coefficients between t and u components (measure of performance) of sPLS model across different numbers of selected X and Y variables. The higher the correlation coefficients the better the model performed. The size of the dots correspond to the mean correlation coefficients across repeats and the colour to the standard deviation. Coordinates outlined with green square represent optimal combinations of keepX and keepY which is 5 X variables and 3 Y variables.

How to select `test.keepX` and `test.keepY`

Any number of test.keepX and test.keepY values can be inputted for tuning. The more values inputted means more models will be built and tested, increasing computational time. When choosing which numbers of X and Y variables to test, consider the type of results expected for follow-up analyses. If we are interested in identifying biomarkers of disease, a small signature is preferable for knock-out wet-lab validation (small keepX and keepY). If we are interested in obtaining a better understanding of the pathways playing a role in a biological system, and plan to validate the signature with gene ontology, then a large signature might be preferable (larger keepX and keepY). We advise to first start with a coarse grid of parameter values as a preliminary run, then refine the grid as deemed appropriate.

Special cases of `tune()`: PCA and CCA

The tune.pca() can be used to calculate and visualise the explained variance for each component.

# load data
data(liver.toxicity)
# run tune.pca to calculate explained variance
tune.res <- tune.pca(liver.toxicity$gene, center = TRUE, scale = TRUE)
# plot the explained variance for each principle component
plot(tune.res)

FIGURE 6: Barchart showing explained variance of Principal Components. The first component explains the most variance and subsequent components explain smaller and smaller proportions of the variance.

In the tune.rcc() function, a coarse grid of possible values for λ1 and λ2 is input to assess every possible pair of parameters instead of keepX. The default grid is defined with 5 equally-spaced discretisation points of the interval [0.001, 1] for each dimension, but this can be changed.

# load data
data(nutrimouse)
X <- nutrimouse$lipid
Y <- nutrimouse$gene

# run tuning on CCA model using default grid values
tune.res <- tune.rcc(X, Y, 
         grid1 = seq(0.001, 1, length = 5),
         grid2 = seq(0.001, 1, length = 5),)

# print optimal lambda values
tune.res$opt.lambda1

## [1] 0.25075

tune.res$opt.lambda2

## [1] 0.25075

# plot output
plot(tune.res)

FIGURE 7: Heatmap representing the first canonical correlation on the left-out sets using cross-validation for all tested combinations of Lambda1 and Lambda2. The highest cross-validation score corresponds to the optimal Lambda1 and Lambda2 values.