NOTE: This page details an updated model assessment and parameter tuning workflow which has recently been implemented in our development version of mixOmics. We invite you to install this development version and use it your analysis, we are currently in beta-testing mode so welcome any feedback! If you prefer to use the stable version of mixOmics on Bioconductor please ignore these pages.
Install development version of mixOmics:
devtools::install_github("mixOmicsTeam/mixOmics", ref = "6.31.4")
Once you have chosen your statistical model, you need to tune the
parameters of the model (Step 2 in ‘Performance Assessment and
Hyperparameter Tuning’). The tune()
function uses
cross-validation to aids the choice of parameters for model
building.
Currently, the tune()
function can be applied to the
following models:
We are currently working on expanding the functionality of
tune()
to also work on other models such as block (s)PLS
and other MINT variants. For now the parameters used in these other
models has to be decided by the user without tuning. To see which
parameters you need to input for these models, look at the help pages
e.g. by running ?block.pls
.
Which parameters you need to tune depends on your model. For all models (except sPCA and rcc) you need to decide how many components to include. For sparse models you need to choose which variables to keep and for classification (i.e. -DA) models you need to choose which distance metrics to use.
FIGURE 1: Table showing which parameters (ncomp
,
dist
and keepX
/keepY
) need to be
tuned for which models. Note this table only includes methods for which
their parameters can be tuned using tune()
.
All three parameters (components, variables and distance metrics) can
be tuned using the function tune()
. tune()
runs cross-validation (as described in ‘Performance Assessment and
Hyperparameter Tuning’) to identify parameter combinations which give
the best performaning model. It is possible to simultaneously tune all
three components, however this results in the generation and testing of
many models which can be computationally intensive and take a long time.
Instead, we suggest first tuning for number of components and optionally
distance metrics, and then to tune number of variables to keep.
Tuning the number of components for non-sparse methods is simply done
by calling tune()
on the input X
and
Y
data and choosing how many components to test with
ncomp
.
As described in ‘Performance Assessment and Hyperparameter Tuning’ there are the parameters you can use to choose how to run cross-validation, here we have arbritarily chosen to use 5 folds and repeat the CV 3 times.
Here are some examples of how to tune the number of components to
keep, and in the case of classification models also the distance metric
to use. Note that tuning can be done by calling the specific tune
function directly (e.g. tune.pls()
) or calling the tune
wrapper and specifying the method
(e.g. tune(method = 'pls)
).
PLS (regression model) example
library(mixOmics)
# load data
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
# run tuning on 5 components, may take a few seconds
tune.res <- tune.pls(X, Y, ncomp = 5,
folds = 5, nrepeat = 3,
seed = 12) # for reproducibility with example, remove for own analysis
# plot output
plot(tune.res)
FIGURE 2: Performance assessment metric MSEP plotted for each output Y variable across 5 components for PLS method. The plot suggests that adding more than 1 component does not significantly improve model accuracy.
PLS-DA (classification model) example
# load data
data(breast.tumors)
X <- breast.tumors$gene.exp
Y <- as.factor(breast.tumors$sample$treatment)
# run tuning on 5 components, may take a few seconds
tune.res <- tune(X, Y, ncomp = 5,
method = "plsda",
folds = 5, nrepeat = 3,
dist = "all",
seed = 12)
# plot output
plot(tune.res)
FIGURE 3: Classification error rate plotted across 5 components and 3
distance metrics for PLS-DA method. The plot suggests that adding more
than 1 component does not significantly improve model accuracy, and that
the distance metric max.dist
is most accurate at one
component.
For classification models, as well as plotting the performance
metrics across multiple components, tune()
calculates the
optimal number of components per distance metric, which can be extracted
using tune.res$choice.ncomp
. Here, as suggested by the
plot, the optimal number of components is 1. The plot also suggests that
the best distance metric to use is max.dist
as it gives the
lowest classification error rate for the 1 component model. See here for more
information on the three different distance metrics used in
mixOmics.
tune.res$choice.ncomp
## max.dist centroids.dist mahalanobis.dist
## overall 1 1 1
## BER 1 1 1
Tuning sparse models (e.g. sPLS-DA), the same procedure is used,
however the number of variables to test should be set to
NULL
to only tune number of components. Note that when
test.keepX
is set to NULL
, tune()
will print a message to state that it is only tuning for number of
components.
sPLS-DA (classification model) example
# run tuning on 5 components, may take a few seconds
tune.res <- tune(X, Y, ncomp = 5,
method = "splsda",
folds = 5, nrepeat = 3,
test.keepX = NULL,
dist = "all",
seed = 12)
## [1] "test.keepX set to NULL, tuning only for number of components..."
# print optimal ncomp
tune.res$choice.ncomp
## max.dist centroids.dist mahalanobis.dist
## overall 1 1 1
## BER 1 1 1
For sparse methods, lasso penalisation is applied to reduce
the loading values of some features to zero, therefore only retaining
features which are relevant to the model. The tune()
function can be used to identify what is the optimal number of features
to be retained.
By inputting a grid of variable numbers,
e.g. test.keepX = c(5, 10, 15)
, cross-validation is
performed on each number of variables for each component. The
tune()
function returns the set of variables to select that
achieves the best predictive performance for each of the components that
are progressively added in the model, while favouring a small
signature.
As shown above, the optimal number of components for the example
sPLS-DA model is 1
and the optimal distance metric is
max.dist
. These optimised metrics can now be used to reduce
the number of models that need to be tested to choose the number of
variables to keep. Here we will increase ncomp
to
2
for plotting reasons.
sPLS-DA (classification model) example
# run tune on three different keepX values (may take a few seconds)
tune.res <- tune(method = "splsda", X, Y,
ncomp = 2,
dist = "max.dist",
folds = 5, nrepeat = 3,
test.keepX = c(5, 10, 15),
seed = 12)
# plot the resulting output per test.keepX value
plot(tune.res)
FIGURE 4: Balanced error rate plotted across 2 components using 5, 10 and 15 X variables for PLS-DA method. The plot suggests that the sPLS-DA models with one component are more accurate than those with 2 components across all numbers of selected features. For the model made with 1 component, keeping 10 features produces the most accurate model. For the model made with 2 components, keeping 15 features produces the most accurate model.
Like ncomp
, the optimal number of
choice.keepX
can be extracted from the tune()
output. In agreement with the plot above, this is 10.
tune.res$choice.keepX
## comp1 comp2
## 10 15
sPLS (regression model) example
For sPLS models, both the number of X and Y variables can be tuned. In this case, the optimal combination of keepX and keepY is identified.
# set up data
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
# tune across a grid of test.keepX and test.keepY for component 1
tune.res <- tune.spls(X, Y, ncomp = 1,
test.keepX = c(5, 10, 15),
test.keepY = c(3, 6, 8),
folds = 5, nrepeat = 3)
# print the optimal keepX and keepY values for component 1
tune.res$choice.keepX
## comp1
## 5
tune.res$choice.keepY
## comp1
## 3
# plot the outputting performance assessment for each keepX and keepY combination
plot(tune.res)
FIGURE 5: Tuning plot showing correlation coefficients between t and u components (measure of performance) of sPLS model across different numbers of selected X and Y variables. The higher the correlation coefficients the better the model performed. The size of the dots correspond to the mean correlation coefficients across repeats and the colour to the standard deviation. Coordinates outlined with green square represent optimal combinations of keepX and keepY which is 5 X variables and 3 Y variables.
test.keepX
and
test.keepY
Any number of test.keepX
and test.keepY
values can be inputted for tuning. The more values inputted means more
models will be built and tested, increasing computational time. When
choosing which numbers of X and Y variables to test, consider the type
of results expected for follow-up analyses. If we are interested in
identifying biomarkers of disease, a small signature is preferable for
knock-out wet-lab validation (small keepX
and
keepY
). If we are interested in obtaining a better
understanding of the pathways playing a role in a biological system, and
plan to validate the signature with gene ontology, then a large
signature might be preferable (larger keepX
and
keepY
). We advise to first start with a coarse grid of
parameter values as a preliminary run, then refine the grid as deemed
appropriate.
tune()
: PCA and CCAThe tune.pca()
can be used to calculate and visualise
the explained variance for each component.
# load data
data(liver.toxicity)
# run tune.pca to calculate explained variance
tune.res <- tune.pca(liver.toxicity$gene, center = TRUE, scale = TRUE)
# plot the explained variance for each principle component
plot(tune.res)
FIGURE 6: Barchart showing explained variance of Principal Components. The first component explains the most variance and subsequent components explain smaller and smaller proportions of the variance.
In the tune.rcc()
function, a coarse grid of possible
values for λ1
and λ2
is input to assess every
possible pair of parameters instead of keepX
. The default
grid is defined with 5 equally-spaced discretisation points of the
interval [0.001, 1] for each dimension, but this can be changed.
# load data
data(nutrimouse)
X <- nutrimouse$lipid
Y <- nutrimouse$gene
# run tuning on CCA model using default grid values
tune.res <- tune.rcc(X, Y,
grid1 = seq(0.001, 1, length = 5),
grid2 = seq(0.001, 1, length = 5),)
# print optimal lambda values
tune.res$opt.lambda1
## [1] 0.25075
tune.res$opt.lambda2
## [1] 0.25075
# plot output
plot(tune.res)
FIGURE 7: Heatmap representing the first canonical correlation on the left-out sets using cross-validation for all tested combinations of Lambda1 and Lambda2. The highest cross-validation score corresponds to the optimal Lambda1 and Lambda2 values.