Performance Assessment and Hyperparameter Tuning

For all types of data and statistical models, we recommend following these steps to ensure you have created the best possible statistical model:

Decide which statistical model is appropriate for your question and data (see ‘Selecting your Method’)
Tune the model’s parameters (using tune())
Build your optimised model
If you are using a supervised model, assess the performance of your optimised model (using perf.assess(), auroc() and predict())

Step (2) and step (4) are highly connected, because in order to tune your model’s parameters you need to be able to test how good your model is (i.e. assess its performance). This page will give an overview of the theory of how model performance assessment works and the key concepts that underpin the functions tune(), perf.assess(), auroc() and predict().

How performance assessement works

All supervised models are designed to predict outcome Y based on input X. For example, in the sPLS-DA case study using the SRBCT dataset (see link below) the input X data is gene expression, which is used to predict Y which is the tumour class.

library(mixOmics)
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix
dim(X) # check the dimensions of the X dataframe

## [1]   63 2308

summary(Y) # check the distribution of class labels

## EWS  BL  NB RMS 
##  23   8  12  20

To check how well these models perform, we measure the accuracy at which the model predicts outcome ‘Y’ based on input ‘X’. This can either be done using 1) test data or 2) cross-validation.

1) Using test data

To assess the performance of a supervised model, you start by training it on a labelled training dataset where both input (X) and outputs (Y) are known. After training, the model is tested on a separate dataset called a test dataset, which also has known X and Y values. By feeding only the X inputs from the test data into the model, it generates predictions for Y, which are then compared to the actual Y values. The accuracy of these predictions shows how well the model can generalize to new, unseen data.

Using the SRBCT example, you first use training data to train a model to predict tumour class based on gene expression. You then take a new dataset the model hasn’t seen before and feed it gene expression to see what tumour classes it predicts. You then check if the predicted tumour classes are accurate.

FIGURE 1: Basic process of how statistical models are assessed, illustrated here with a classification-type model e.g. PLS-DA where Y is categorical. Training data X and Y is used to train a model. Performance assessment of the model is achieved using test X and Y data. The test X data is fed into the model which predicts corresponding Y values, which can then be compared to the ‘true’ test Y values.

In mixOmics, predicting test Y ouput from test X input data can be performed using the function predict()

2) Using cross-validation

In many cases we don’t have an independent test dataset, so instead we artificially create one from the training dataset using cross-validation. There are different kinds of cross-validation, but all methods split up the data so that a portion of it is used as the training data and the rest is used as test data.

In mixOmics, the function perf.assess() uses cross-validation to assess the performance of your final model, and tune() uses cross-validation to assess the performance of models with different parameters to help choose the optimal parameter of a model

Cross-validation parameters

validation

In mixOmics, two types of cross-validation are used: repeated M-fold (Mfold) or ‘leave-one-out’ (loo).

In the repeated M-fold method, the data is divided into equally sized folds, with one fold used as the test data and the rest as training data. This process is repeated multiple times to account for the randomness in data partitioning, ensuring a reliable measure of model performance. Mfold cross-validation can be customised using arguments nrepeat and folds (see Figure 2 and below).

FIGURE 2: Schematic showing how Mfold cross-validation works when folds = 4 and nrepeat = 3. The data is represented as a square which is split into 4 folds, each of which is used in turn as the test data with the remaining three folds used as training data. The partitioning of the data into folds is repeated 3 times. Note that the partition of test and training subsets in the data is shown here as organised squares and rectangles for visual clarity, but in reality this partition is random. The randomness is why the cross-validation approach is repeated multiple times to achieve a stable output.

In leave-one-out cross-validation, each study in turn is set aside as the test data. The model is trained on the remaining studies and then applied to the left-out test study. To assess the model’s performance, the correlation between the components from the test data and full data is calculated.

For MINT-(s)PLS-DA models, cross-validation can only be performed using loo, however for other models you can decide whether to run Mfold or loo cross-validation. loo is recommended for small numbers of samples (<10) and Mfold for larger data sets.

folds

We advise choosing a fold number which means that each fold contains at least 5-6 samples (e.g. if you have 20 samples do not choose more than 4 folds). 10-fold cross-validation is often used for large studies, whereas folds = 3 to 5 is used for smaller studies.

nrepeat

nrepeat dictates how many times the cross-validation process is repeated, larger nrepeat values will increase computational time. We advise in the final stage of evaluation to include between 50 - 100 repeats. Note that randomness of cross-validation can be fixed by setting the seed parameter, however this is not advised as a substitute for increasing nrepeat.

Performance assessment outputs

Whether model performance is assessed via test data or cross-validation, there are several metrics that can be examined.

Misclassification error rates

For classification problems (i.e. -DA models), performance assessment outputs include overall misclassification error rate (ER) and the Balanced Error Rate (BER). BER is appropriate for cases with an unbalanced number of samples per class as it calculates the average proportion of wrongly classified samples in each class, weighted by the number of samples in each class. Therefore, contrary to ER, BER is less biased towards majority classes during the evaluation.

These two metrics are calculated across three different distance measures: max.dist, centroids.dist and mahalanobis.dist (see ‘Distance Metrics’ for more details).

ROC and AUC

Also used for classification problems, Area Under the Curve (AUC) can be used for performance evaluation. AUC is a commonly used measure to evaluate a classifier’s discriminative ability (the closer to 1, the better the prediction). It incorporates measures of sensitivity and specificity for every possible cut-oﬀ of the predicted outcomes. When the number of classes in the outcome is greater than two, we use the one-vs-all (other classes) comparison.

Note that ROC and AUC may not be particularly insightful because the prediction thresholds used are based on the specific distance metric chosen, so these metrics should use to complement other performance metrics.

Error rates and correlation coefficients between components

For regression problems (i.e. (s)PLS) the evaluation of performance is not as straightforward. Currently, mixOmics uses measures of accuracy based on errors (MSEP, RMSEP), R2, and Q2 for sPL2 models and correlation coeﬃcients between components and Residual Sum of Squares (RSS). for sPLS1 models.