Assessing final model performance

NOTE: This page details an updated model assessment and parameter tuning workflow which has recently been implemented in our development version of mixOmics. We invite you to install this development version and use it your analysis, we are currently in beta-testing mode so welcome any feedback! If you prefer to use the stable version of mixOmics on Bioconductor please ignore these pages.

Install development version of mixOmics:

devtools::install_github("mixOmicsTeam/mixOmics", ref = "6.31.4")

Once you have tuned your parameters and built your final model it’s time to assess the performance of your final model and potentially use it to predict outcomes for new data.

Calculating performance metrics using perf.assess()

The perf.assess() function assesses the performance of supervised statistical methods using cross-validation, this is performed in a similar way to tune() but instead of building numerous models with different parameters and assessing their performance is just performed on your final model. See ‘Performance Assessment and Hyperparameter Tuning’ for more details on parameters used to run cross-validation.

Currently, perf.assess() can be used to assess the performance of the following models built in mixOmics:

  • (s)PLS-DA
  • (s)PLS
  • block (s)PLS-DA (i.e. DIABLO)
  • MINT (s)PLS-DA

perf.assess() outputs a list of performance assessment metrics which vary depending on the model in question. For details on what each of these metrics are refer to help pages (?perf.assess) and ‘Performance Assessment and Hyperparameter Tuning’.

Classification example

In this example, perf.assess() is used to assess the performance of the PLS-DA model in which gene expression (X) is used to classify samples into their tumour classes (Y).

# load in the mixOmics package
library(mixOmics)

# load in the data
data(srbct)
X <- srbct$gene # gene expression matrix
Y <- srbct$class # classes of tumour samples - categorical outcome of model

# build the final plsda model (assume parameters were tuned)
srbct.plsda <- plsda(X, Y, ncomp = 10)

# run performance assessment on the model using appropriate parameters, using distance metric 'max.dist'
print(nrow(X)) # 63 samples in the data
## [1] 63
perf.res <- perf.assess(srbct.plsda, dist = "max.dist",
                         validation = "Mfold", # because number of samples > 10
                         folds = 5, # this would give >12 samples per fold
                         nrepeat = 10, # for real analysis this would be increased to 50-100
                         seed = 12) # for reproducibility with example, remove for own analysis 

# print final overall error rate of model
perf.res$error.rate$overall[,'max.dist']
## [1] 0.01587302
# print final error rate per class
perf.res$error.rate.class$max.dist
##        EWS         BL         NB        RMS 
## 0.04347826 0.00000000 0.00000000 0.00000000

We can see from the outputs of perf.assess() that our final PLS-DA model performs well because it has low overall misclassification error rate. The error rate per class shows that cross-validation found that the model only inaccurately predicted the EWS tumour class, whilst BL, NB and RMS tumours were correctly identified across the 5 folds and 10 repeats.

As well as overall misclassification error rate, perf.assess() calculates Balanced Error Rate (BER) which is appropriate for cases with an unbalanced number of samples (for more details see ‘Introduction to Performance Assessment and Hyperparameter Tuning’). This seems to be the case for our data, and in fact the BER is slightly lower than the overall error rate so BER is a more appropriate metric for this data.

# print distribution of samples across classes
summary(Y)
## EWS  BL  NB RMS 
##  23   8  12  20
# print final balanced error rate of model
perf.res$error.rate$BER[,'max.dist']
## [1] 0.01086957

Regression example

In this example, perf.assess() is used to assess the performance of the PLS model in which gene expression (X) is used to predict 10 clinical outputs (Y).

# load data for PLS model
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic

# how many Y variables are there
length(Y)
## [1] 10
# build PLS model
liver.pls <- pls(X, Y, ncomp = 5)

# assess model performance
perf.res <- perf.assess(liver.pls, 
                        validation = "Mfold", # because number of samples > 10
                         folds = 5, # this would give >12 samples per fold
                         nrepeat = 10, # for real analysis this would be increased to 50-100
                         seed = 12) # for reproducibility with example, remove for own analysis 


# print MSEP performance metric averaged across repeats and folds
perf.res$measures$MSEP$summary
##               feature comp     mean        sd
## 1           ALB.g.dL.    5 6.168735 9.3743886
## 2           ALP.IU.L.    5 1.359264 0.7531981
## 3           ALT.IU.L.    5 2.067335 3.1881475
## 4           AST.IU.L.    5 2.075162 3.1369310
## 5          BUN.mg.dL.    5 1.374110 1.0331901
## 6  Cholesterol.mg.dL.    5 3.013613 4.8495555
## 7        Creat.mg.dL.    5 2.949239 2.6933834
## 8           SDH.IU.L.    5 4.222506 7.3596578
## 9         TBA.umol.L.    5 2.079981 2.6525451
## 10           TP.g.dL.    5 6.152539 9.7279386

Here we printed the mean squared error of prediction (MSEP) metric, although several performance assessment metrics are generated by perf.assess() and can be inspected. MSEP is calculated by taking the average of the squared differences between actual and predicted values, and then expressing it as a percentage of the actual value. This metric is commonly used in regression analysis to assess the performance of models, and a lower MSEP value indicates better model performance.

We can see that the different output variables can be predicted by our PLS models with differing degrees of accuracy, for example ALB.g.dL and TP.g.dL. have the highest error rates so are most likely to be inaccurately predicted by the model.

AUC and ROC

For classification models, AUC and ROC can be used as complementary performance metrics, for more details see ‘Introduction to Performance Assessment and Hyperparameter Tuning’.

AUC values using perf.assess()

The perf.assess() function outputs area under the curve (AUC) for classification models. AUC or AUROC values close to 1 indicate perfect classification performance and close to 0.5 indicates that the model cannot discriminate Y outputs.

# Calculate performance assessment with auroc
perf.res <- perf.assess(srbct.plsda, 
                          validation = "Mfold", folds = 5, nrepeat = 10,
                          auc = TRUE) # added this argument to calculate auc

# Print the AUC values on component 10
perf.res$auc$comp10
##                 AUC.mean      AUC.sd
## EWS vs Other(s)  1.00000 0.000000000
## BL vs Other(s)   1.00000 0.000000000
## NB vs Other(s)   1.00000 0.000000000
## RMS vs Other(s)  0.99801 0.001559523

The AUC values for our PLS-DA model with 10 components are very close to 1, indicating that the model can predict tumour classes from gene expression very well.

Making AUROC plots using auroc()

You can plot Receiver Operating Characteristic (ROC) curves using the auroc function. Here we plot ROC curves for the previously build PLS-DA model which predicts tumour classes based on gene expression.

# Plot ROC curve for the PLS-DA model with 5 components
auc.plsda <- auroc(srbct.plsda, roc.comp = 5, print = FALSE)

FIGURE 1: ROC curve for PLS-DA model on components 1 to 5. Diagonal line represents random guessing (no discriminative power). Each coloured line shows values for each of the four tumour classes the PLS-DA model. As lines are superimposed only one can be seen.

Curves which are closer to the top-left corner suggest that the model performs well. As there is little difference between the performance of the model in prediciting each of the four tumour classes, the lines are superimposed. If we create the same plot for the model with less components, we can see the model performs worse and now there is a difference in discriminative ability between the different tumour classes.

# Plot ROC curve for the PLS-DA model with 1 component
auc.plsda <- auroc(srbct.plsda, roc.comp = 1, print = FALSE)

FIGURE 2: ROC curve for PLS-DA model on component 1. Diagonal line represents random guessing (no discriminative power). Each coloured line shows values for each of the four tumour classes the PLS-DA model.

Making predictions on new data using predict()

When we have access to an external data set, prediction represents the ultimate aim to assess in silico the generalisability of the model.

Currently, the predict() function can be applied to the following models:

  • (s)PLS
  • MINT sPLS-DA
  • Block PLS

Prediction is achieved by applying the predict() function to the model and inputting a test X dataset from which to predict the output Y. The predict() function also takes a dist argument, see the ‘Distance Metrics’ page for more information on choosing this.

Classification example

Here we use the same data we used previously to construct a PLS-DA model, in which gene expression X is used to predict tumour class Y. As we don’t have real test data, we split the data before building the model to only train the model on 50 samples, and leave 13 samples separate so they can be used for prediction.

## Load data
data(srbct)
X <- srbct$gene # gene expression data as the X matrix
Y <- srbct$class # tumour class data (categorical) as the Y matrix

# split data into training and test data
set.seed(12) # to reproducibly subsample the data into training and test
train <- sample(1:nrow(X), 50) # randomly select 50 samples in training
test <- setdiff(1:nrow(X), train) # rest is part of the test set

# store matrices into training and test set:
X.train <- X[train, ]
X.test <- X[test,]
Y.train <- Y[train]
Y.test <- Y[test]

# check the dimensions of the training and test data
dim(X.train) # 50 samples gene expression
## [1]   50 2308
summary(Y.train) # 50 samples tumour classification
## EWS  BL  NB RMS 
##  15   7  10  18
dim(X.test) # 13 samples gene expression
## [1]   13 2308
summary(Y.test) # 13 samples tumour classification
## EWS  BL  NB RMS 
##   8   1   2   2
# train the model on the training data
train.plsda.srbct <- plsda(X.train, Y.train, ncomp = 10)

# use the model to predict Ytest from the Xtest data
# ie the tumour classes of the 13 test samples
predict.plsda.srbct <- predict(train.plsda.srbct, X.test)

# evaluate the prediction accuracy of model with 10 components
predict.comp <- predict.plsda.srbct$class$max.dist[,10]
table(factor(predict.comp, levels = levels(Y)), Y.test)
##      Y.test
##       EWS BL NB RMS
##   EWS   7  0  0   0
##   BL    0  1  0   0
##   NB    0  0  2   0
##   RMS   1  0  0   2

Here we used predict() to predict the tumour classification of 13 test samples based on their gene expression. The results of this prediction for a model with 10 components using max.dist were extracted using predict.plsda.srbct$class$max.dist[,10].

The real tumour classifications for these 13 samples is known (Y.test). Therefore we can assess how accurate these predictions are by making a classification matrix. In the classification matrix, the real Y.test values are along the columns and the predicted values are the rows.

This matrix shows that the tumour classes only 1 of the 13 test samples were incorrectly classified by the model, a EWS was predicted to be RMS.

To test how prediction accuracy changes when we reduce the number of components, we can extract the tumour class predicitions for the PLS-DA model with 1 component:

# evaluate the prediction accuracy of model with 1 components
predict.comp <- predict.plsda.srbct$class$max.dist[,1]
table(factor(predict.comp, levels = levels(Y)), Y.test)
##      Y.test
##       EWS BL NB RMS
##   EWS   1  0  1   0
##   BL    0  1  1   0
##   NB    0  0  0   0
##   RMS   7  0  0   2

With the 1 component model we can see that the model performs poorly. Only 4 of the 13 test samples were correctly classified by the model with 1 component, as compared to the 12 samples predicted using the model with 10 components. The problem mainly seems to come from EWS samples which are incorrectly classified as RMS, this occured for 7 of the test samples.

Regression example

Here we use a different dataset, where numerical values for exercise metrics (X) are used to predict physiological metrics (Y) using a PLS model. Similar to the classification example, we don’t have test data. Instead of splitting the data into training and test like we did before, here we artificially create two individuals to be the test data.

# Load data
data(linnerud)
X <- linnerud$exercise # numerical values for chins, situps and jumps
Y <- linnerud$physiological # numerical values for weight, waist and pulse

# Build PLS model
linn.pls <- pls(X, Y, ncomp = 2, mode = "classic")

# Create test data artificially
indiv1 <- c(200, 40, 60)
indiv2 <- c(190, 45, 45)
X.test <- rbind(indiv1, indiv2)
colnames(X.test) <- colnames(X)

# Predict the Y outcome for new X test data
predict.linn.pls <- predict(linn.pls, X.test)

# Print predicted Y values for two new test individuals
print(predict.linn.pls$predict)
## , , dim1
## 
##          Weight    Waist    Pulse
## indiv1 11.95177 7.140617 75.80745
## indiv2 21.37241 8.738121 74.69339
## 
## , , dim2
## 
##           Weight     Waist    Pulse
## indiv1 -28.98181 -3.864518 86.34437
## indiv2 -19.33882 -2.207234 85.17307

Here, based on our artificially created individuals with specific X values for ‘Chins’, ‘Situps’ and ‘Jumps’, our model has predicted Y outcomes ‘Weight’, ‘Waist’ and ‘Pulse’. This example demonstrates how predict() can be used to output Y values for samples where we only know X. As we don’t have the ‘real’ output values, we cannot check the accuracy of this prediction.