NOTE: This page details an updated model assessment and parameter tuning workflow which has recently been implemented in our development version of mixOmics. We invite you to install this development version and use it your analysis, we are currently in beta-testing mode so welcome any feedback! If you prefer to use the stable version of mixOmics on Bioconductor please ignore these pages.
Install development version of mixOmics:
devtools::install_github("mixOmicsTeam/mixOmics", ref = "6.31.4")
For all types of data and statistical models, we recommend following these steps to ensure you have created the best possible statistical model:
Decide which statistical model is appropriate for your question and data (see Selecting your Method)
Tune the model’s parameters (using tune()
)
Build your optimised model
If you are using a supervised model, assess the performance of
your optimised model (using perf.assess()
,
auroc()
and predict()
)
Step (2) and step (4) are highly connected, because in order to tune
your model’s parameters you need to be able to test how good your model
is (i.e. assess its performance). This page will give an overview of the
theory of how model performance assessment works and the key concepts
that underpin the functions tune()
,
perf.assess()
, auroc()
and
predict()
.
For more practical details on how to perform parameter tuning (Step 2) click here and for how to assess your final model (Step 4) click here.
All supervised models are designed to predict outcome Y based on input X. For example, in the sPLS-DA case study using the SRBCT dataset the input X data is gene expression, which is used to predict Y which is the tumour class.
library(mixOmics)
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
dim(X) # check the dimensions of the X dataframe
## [1] 63 2308
summary(Y) # check the distribution of class labels
## EWS BL NB RMS
## 23 8 12 20
To check how well these models perform, we measure the accuracy at which the model predicts outcome ‘Y’ based on input ‘X’. This can either be done using 1) test data or 2) cross-validation.
To assess the performance of a supervised model, you start by training it on a labelled training dataset where both input (X) and outputs (Y) are known. After training, the model is tested on a separate dataset called a test dataset, which also has known X and Y values. By feeding only the X inputs from the test data into the model, it generates predictions for Y, which are then compared to the actual Y values. The accuracy of these predictions shows how well the model can generalize to new, unseen data.
Using the SRBCT example, you first use training data to train a model to predict tumour class based on gene expression. You then take a new dataset the model hasn’t seen before and feed it gene expression to see what tumour classes it predicts. You then check if the predicted tumour classes are accurate.
FIGURE 1: Basic process of how statistical models are assessed, illustrated here with a classification-type model e.g. PLS-DA where Y is categorical. Training data X and Y is used to train a model. Performance assessment of the model is achieved using test X and Y data. The test X data is fed into the model which predicts corresponding Y values, which can then be compared to the ‘true’ test Y values.
In mixOmics, predicting test Y ouput from test X input data can
be performed using the function predict()
In many cases we don’t have an independent test dataset, so instead we artificially create one from the training dataset using cross-validation. There are different kinds of cross-validation, but all methods split up the data so that a portion of it is used as the training data and the rest is used as test data.
In mixOmics, the function perf.assess()
uses
cross-validation to assess the performance of your final model, and
tune()
uses cross-validation to assess the performance of
models with different parameters to help choose the optimal parameter of
a model
In mixOmics, two types of cross-validation are used: repeated M-fold
(Mfold
) or ‘leave-one-out’ (loo
).
In the repeated M-fold method, the data is divided into equally sized
folds, with one fold used as the test data and the rest as training
data. This process is repeated multiple times to account for the
randomness in data partitioning, ensuring a reliable measure of model
performance. Mfold
cross-validation can be customised using
arguments nrepeat
and folds
(see Figure 2 and
below).
FIGURE 2: Schematic showing how Mfold
cross-validation
works when folds = 4
and nrepeat = 3
. The data
is represented as a square which is split into 4 folds, each of which is
used in turn as the test data with the remaining three folds used as
training data. The partitioning of the data into folds is repeated 3
times. Note that the partition of test and training subsets in the data
is shown here as organised squares and rectangles for visual clarity,
but in reality this partition is random. The randomness is why the
cross-validation approach is repeated multiple times to achieve a stable
output.
In leave-one-out cross-validation, each study in turn is set aside as the test data. The model is trained on the remaining studies and then applied to the left-out test study. To assess the model’s performance, the correlation between the components from the test data and full data is calculated.
For MINT-(s)PLS-DA models, cross-validation can only be performed
using loo
, however for other models you can decide whether
to run Mfold
or loo
cross-validation.
loo
is recommended for small numbers of samples (<10)
and Mfold
for larger data sets.
We advise choosing a fold number which means that each fold contains
at least 5-6 samples (e.g. if you have 20 samples do not choose more
than 4 folds). 10-fold cross-validation is often used for large studies,
whereas folds
= 3 to 5 is used for smaller studies.
nrepeat
dictates how many times the cross-validation
process is repeated, larger nrepeat
values will increase
computational time. We advise in the final stage of evaluation to
include between 50 - 100 repeats. Note that randomness of
cross-validation can be fixed by setting the seed
parameter, however this is not advised as a substitute for increasing
nrepeat
.
Whether model performance is assessed via test data or cross-validation, there are several metrics that can be examined.
For classification problems (i.e. -DA models), performance assessment outputs include overall misclassification error rate (ER) and the Balanced Error Rate (BER). BER is appropriate for cases with an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. Therefore, contrary to ER, BER is less biased towards majority classes during the evaluation.
These two metrics are calculated across three different distance measures: max.dist, centroids.dist and mahalanobis.dist (see Distance Metrics).
Also used for classification problems, Area Under the Curve (AUC) can be used for performance evaluation. AUC is a commonly used measure to evaluate a classifier’s discriminative ability (the closer to 1, the better the prediction). It incorporates measures of sensitivity and specificity for every possible cut-off of the predicted outcomes. When the number of classes in the outcome is greater than two, we use the one-vs-all (other classes) comparison.
Note that ROC and AUC may not be particularly insightful because the prediction thresholds used are based on the specific distance metric chosen, so these metrics should use to complement other performance metrics.
For regression problems (i.e. (s)PLS) the evaluation of performance is not as straightforward. Currently, mixOmics uses measures of accuracy based on errors (MSEP, RMSEP), R2, and Q2 for sPL2 models and correlation coefficients between components and Residual Sum of Squares (RSS). for sPLS1 models.