NOTE: This page details an updated model assessment and parameter tuning workflow which has recently been implemented in our development version of mixOmics. We invite you to install this development version and use it your analysis, we are currently in beta-testing mode so welcome any feedback! If you prefer to use the stable version of mixOmics on Bioconductor please ignore these pages.
Install development version of mixOmics:
devtools::install_github("mixOmicsTeam/mixOmics", ref = "6.31.4")
Once you have tuned your parameters and built your final model it’s time to assess the performance of your final model and potentially use it to predict outcomes for new data.
perf.assess()
The perf.assess()
function assesses the performance of
supervised statistical methods using cross-validation, this is performed
in a similar way to tune()
but instead of building numerous
models with different parameters and assessing their performance is just
performed on your final model. See ‘Performance Assessment and
Hyperparameter Tuning’ for more details on parameters used to run
cross-validation.
Currently, perf.assess()
can be used to assess the
performance of the following models built in mixOmics:
perf.assess()
outputs a list of performance assessment
metrics which vary depending on the model in question. For details on
what each of these metrics are refer to help pages
(?perf.assess
) and ‘Performance Assessment and
Hyperparameter Tuning’.
In this example, perf.assess()
is used to assess the
performance of the PLS-DA model in which gene expression
(X
) is used to classify samples into their tumour classes
(Y
).
# load in the mixOmics package
library(mixOmics)
# load in the data
data(srbct)
X <- srbct$gene # gene expression matrix
Y <- srbct$class # classes of tumour samples - categorical outcome of model
# build the final plsda model (assume parameters were tuned)
srbct.plsda <- plsda(X, Y, ncomp = 10)
# run performance assessment on the model using appropriate parameters, using distance metric 'max.dist'
print(nrow(X)) # 63 samples in the data
## [1] 63
perf.res <- perf.assess(srbct.plsda, dist = "max.dist",
validation = "Mfold", # because number of samples > 10
folds = 5, # this would give >12 samples per fold
nrepeat = 10, # for real analysis this would be increased to 50-100
seed = 12) # for reproducibility with example, remove for own analysis
# print final overall error rate of model
perf.res$error.rate$overall[,'max.dist']
## [1] 0.01587302
# print final error rate per class
perf.res$error.rate.class$max.dist
## EWS BL NB RMS
## 0.04347826 0.00000000 0.00000000 0.00000000
We can see from the outputs of perf.assess()
that our
final PLS-DA model performs well because it has low overall
misclassification error rate. The error rate per class shows that
cross-validation found that the model only inaccurately predicted the
EWS tumour class, whilst BL, NB and RMS tumours were correctly
identified across the 5 folds and 10 repeats.
As well as overall misclassification error rate,
perf.assess()
calculates Balanced Error Rate (BER) which is
appropriate for cases with an unbalanced number of samples (for more
details see ‘Introduction to Performance Assessment and Hyperparameter
Tuning’). This seems to be the case for our data, and in fact the BER is
slightly lower than the overall error rate so BER is a more appropriate
metric for this data.
# print distribution of samples across classes
summary(Y)
## EWS BL NB RMS
## 23 8 12 20
# print final balanced error rate of model
perf.res$error.rate$BER[,'max.dist']
## [1] 0.01086957
In this example, perf.assess()
is used to assess the
performance of the PLS model in which gene expression (X
)
is used to predict 10 clinical outputs (Y
).
# load data for PLS model
data(liver.toxicity)
X <- liver.toxicity$gene
Y <- liver.toxicity$clinic
# how many Y variables are there
length(Y)
## [1] 10
# build PLS model
liver.pls <- pls(X, Y, ncomp = 5)
# assess model performance
perf.res <- perf.assess(liver.pls,
validation = "Mfold", # because number of samples > 10
folds = 5, # this would give >12 samples per fold
nrepeat = 10, # for real analysis this would be increased to 50-100
seed = 12) # for reproducibility with example, remove for own analysis
# print MSEP performance metric averaged across repeats and folds
perf.res$measures$MSEP$summary
## feature comp mean sd
## 1 ALB.g.dL. 5 6.168735 9.3743886
## 2 ALP.IU.L. 5 1.359264 0.7531981
## 3 ALT.IU.L. 5 2.067335 3.1881475
## 4 AST.IU.L. 5 2.075162 3.1369310
## 5 BUN.mg.dL. 5 1.374110 1.0331901
## 6 Cholesterol.mg.dL. 5 3.013613 4.8495555
## 7 Creat.mg.dL. 5 2.949239 2.6933834
## 8 SDH.IU.L. 5 4.222506 7.3596578
## 9 TBA.umol.L. 5 2.079981 2.6525451
## 10 TP.g.dL. 5 6.152539 9.7279386
Here we printed the mean squared error of prediction (MSEP) metric,
although several performance assessment metrics are generated by
perf.assess()
and can be inspected. MSEP is calculated by
taking the average of the squared differences between actual and
predicted values, and then expressing it as a percentage of the actual
value. This metric is commonly used in regression analysis to assess the
performance of models, and a lower MSEP value indicates better model
performance.
We can see that the different output variables can be predicted by our PLS models with differing degrees of accuracy, for example ALB.g.dL and TP.g.dL. have the highest error rates so are most likely to be inaccurately predicted by the model.
For classification models, AUC and ROC can be used as complementary performance metrics, for more details see ‘Introduction to Performance Assessment and Hyperparameter Tuning’.
perf.assess()
The perf.assess()
function outputs area under the curve
(AUC) for classification models. AUC or AUROC values close to 1 indicate
perfect classification performance and close to 0.5 indicates that the
model cannot discriminate Y outputs.
# Calculate performance assessment with auroc
perf.res <- perf.assess(srbct.plsda,
validation = "Mfold", folds = 5, nrepeat = 10,
auc = TRUE) # added this argument to calculate auc
# Print the AUC values on component 10
perf.res$auc$comp10
## AUC.mean AUC.sd
## EWS vs Other(s) 1.00000 0.000000000
## BL vs Other(s) 1.00000 0.000000000
## NB vs Other(s) 1.00000 0.000000000
## RMS vs Other(s) 0.99801 0.001559523
The AUC values for our PLS-DA model with 10 components are very close to 1, indicating that the model can predict tumour classes from gene expression very well.
auroc()
You can plot Receiver Operating Characteristic (ROC) curves using the
auroc
function. Here we plot ROC curves for the previously
build PLS-DA model which predicts tumour classes based on gene
expression.
# Plot ROC curve for the PLS-DA model with 5 components
auc.plsda <- auroc(srbct.plsda, roc.comp = 5, print = FALSE)
FIGURE 1: ROC curve for PLS-DA model on components 1 to 5. Diagonal line represents random guessing (no discriminative power). Each coloured line shows values for each of the four tumour classes the PLS-DA model. As lines are superimposed only one can be seen.
Curves which are closer to the top-left corner suggest that the model performs well. As there is little difference between the performance of the model in prediciting each of the four tumour classes, the lines are superimposed. If we create the same plot for the model with less components, we can see the model performs worse and now there is a difference in discriminative ability between the different tumour classes.
# Plot ROC curve for the PLS-DA model with 1 component
auc.plsda <- auroc(srbct.plsda, roc.comp = 1, print = FALSE)
FIGURE 2: ROC curve for PLS-DA model on component 1. Diagonal line represents random guessing (no discriminative power). Each coloured line shows values for each of the four tumour classes the PLS-DA model.
predict()
When we have access to an external data set, prediction represents the ultimate aim to assess in silico the generalisability of the model.
Currently, the predict()
function can be applied to the
following models:
Prediction is achieved by applying the predict()
function to the model and inputting a test X dataset from which to
predict the output Y. The predict()
function also takes a
dist
argument, see the ‘Distance Metrics’ page for more
information on choosing this.
Here we use the same data we used previously to construct a PLS-DA
model, in which gene expression X
is used to predict tumour
class Y
. As we don’t have real test data, we split the data
before building the model to only train the model on 50 samples, and
leave 13 samples separate so they can be used for prediction.
## Load data
data(srbct)
X <- srbct$gene # gene expression data as the X matrix
Y <- srbct$class # tumour class data (categorical) as the Y matrix
# split data into training and test data
set.seed(12) # to reproducibly subsample the data into training and test
train <- sample(1:nrow(X), 50) # randomly select 50 samples in training
test <- setdiff(1:nrow(X), train) # rest is part of the test set
# store matrices into training and test set:
X.train <- X[train, ]
X.test <- X[test,]
Y.train <- Y[train]
Y.test <- Y[test]
# check the dimensions of the training and test data
dim(X.train) # 50 samples gene expression
## [1] 50 2308
summary(Y.train) # 50 samples tumour classification
## EWS BL NB RMS
## 15 7 10 18
dim(X.test) # 13 samples gene expression
## [1] 13 2308
summary(Y.test) # 13 samples tumour classification
## EWS BL NB RMS
## 8 1 2 2
# train the model on the training data
train.plsda.srbct <- plsda(X.train, Y.train, ncomp = 10)
# use the model to predict Ytest from the Xtest data
# ie the tumour classes of the 13 test samples
predict.plsda.srbct <- predict(train.plsda.srbct, X.test)
# evaluate the prediction accuracy of model with 10 components
predict.comp <- predict.plsda.srbct$class$max.dist[,10]
table(factor(predict.comp, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 7 0 0 0
## BL 0 1 0 0
## NB 0 0 2 0
## RMS 1 0 0 2
Here we used predict()
to predict the tumour
classification of 13 test samples based on their gene expression. The
results of this prediction for a model with 10 components using max.dist
were extracted using
predict.plsda.srbct$class$max.dist[,10]
.
The real tumour classifications for these 13 samples is known
(Y.test
). Therefore we can assess how accurate these
predictions are by making a classification matrix. In the classification
matrix, the real Y.test
values are along the columns and
the predicted values are the rows.
This matrix shows that the tumour classes only 1 of the 13 test samples were incorrectly classified by the model, a EWS was predicted to be RMS.
To test how prediction accuracy changes when we reduce the number of components, we can extract the tumour class predicitions for the PLS-DA model with 1 component:
# evaluate the prediction accuracy of model with 1 components
predict.comp <- predict.plsda.srbct$class$max.dist[,1]
table(factor(predict.comp, levels = levels(Y)), Y.test)
## Y.test
## EWS BL NB RMS
## EWS 1 0 1 0
## BL 0 1 1 0
## NB 0 0 0 0
## RMS 7 0 0 2
With the 1 component model we can see that the model performs poorly. Only 4 of the 13 test samples were correctly classified by the model with 1 component, as compared to the 12 samples predicted using the model with 10 components. The problem mainly seems to come from EWS samples which are incorrectly classified as RMS, this occured for 7 of the test samples.
Here we use a different dataset, where numerical values for exercise
metrics (X
) are used to predict physiological metrics
(Y
) using a PLS model. Similar to the classification
example, we don’t have test data. Instead of splitting the data into
training and test like we did before, here we artificially create two
individuals to be the test data.
# Load data
data(linnerud)
X <- linnerud$exercise # numerical values for chins, situps and jumps
Y <- linnerud$physiological # numerical values for weight, waist and pulse
# Build PLS model
linn.pls <- pls(X, Y, ncomp = 2, mode = "classic")
# Create test data artificially
indiv1 <- c(200, 40, 60)
indiv2 <- c(190, 45, 45)
X.test <- rbind(indiv1, indiv2)
colnames(X.test) <- colnames(X)
# Predict the Y outcome for new X test data
predict.linn.pls <- predict(linn.pls, X.test)
# Print predicted Y values for two new test individuals
print(predict.linn.pls$predict)
## , , dim1
##
## Weight Waist Pulse
## indiv1 11.95177 7.140617 75.80745
## indiv2 21.37241 8.738121 74.69339
##
## , , dim2
##
## Weight Waist Pulse
## indiv1 -28.98181 -3.864518 86.34437
## indiv2 -19.33882 -2.207234 85.17307
Here, based on our artificially created individuals with specific
X
values for ‘Chins’, ‘Situps’ and ‘Jumps’, our model has
predicted Y
outcomes ‘Weight’, ‘Waist’ and ‘Pulse’. This
example demonstrates how predict()
can be used to output
Y
values for samples where we only know X
. As
we don’t have the ‘real’ output values, we cannot check the accuracy of
this prediction.