SRBCT sPLS-DA Case Study, but with permuted class labels
This case study will use the exact same methodology as the sPLS-DA SRBCT Case Study. However, prior to model construction, the class labels (srbct$class
) will be randomly permuted. The proportions of each class will be maintained, but the instances they are associated with will be different. This is done to exemplify a case where the distinction between the provided classes is minimal and/or the data are quite noisy.
sPLS-DA is fairly ineffective when looking at classes defined by linear or non-linear relationships. It is effective when the classes cluster according to a set of “signal” features, even in the presence of large quantities of noise attributes [1].
Rscript
The R script used for all the analysis in this case study is available here.
Note that seed
is not set in this script, so re-running the code will result in slightly different outputs (i.e. values and plots) from those shown here.
Set up models
library(mixOmics) # import the mixOmics library
data(srbct) # extract the small round bull cell tumour data
X <- srbct$gene # use the gene expression data as the X matrix
Y <- srbct$class # use the class data as the Y matrix - NON permuted
Yp <- sample(srbct$class) # use the class data as the Y matrix - permuted
optimal.ncomp <- 3
optimal.keepX <- c(9, 260, 30)
Rather than undergoing the entire tuning process again, the optimal values yielded from the sPLS-DA SRBCT Case Study will be used here – they can be seen directly below. Using these values, the sPLS-DA model can be constructed.
optimal.ncomp
## [1] 3
optimal.keepX
## [1] 9 260 30
final.splsda.normal <- splsda(X, Y,
ncomp = optimal.ncomp,
keepX = optimal.keepX)
final.splsda.permuted <- splsda(X, Yp,
ncomp = optimal.ncomp,
keepX = optimal.keepX)
Variable Stability
sPLS-DA is ultimately a feature selection tool – such that variables that best separate the classes will be extracted. Hence, asssessing the stability (frequency of feature selection over cross validated folds) can give an indicator of the performance of said feature selection.
Below, the number of features with a stability of above 0.4 and the mean stability value of all selected features can be seen. For all three components, both these values were lower in the permuted case compared to the normal case. As this was over 5 folds and 10 repeats, this difference can be assumed to be significant. By permuting the labels, the ability of the model to locate signal features is significantly hindered. More noisy features are interpreted as useful by the model, resulting in lower average stabilities of selected features.
## [1] "Normal component 1" ## - Features with high stability: 5 ## - Mean stability: 0.1232877 ## [1] "Normal component 2" ## - Features with high stability: 206 ## - Mean stability: 0.2042419 ## [1] "Normal component 3" ## - Features with high stability: 24 ## - Mean stability: 0.15625 ## [1] "Permutted component 1" ## - Features with high stability: 2 ## - Mean stability: 0.04891304 ## [1] "Permutted component 2" ## - Features with high stability: 49 ## - Mean stability: 0.12482 ## [1] "Permutted component 3" ## - Features with high stability: 0 ## - Mean stability: 0.04601227
Plots
Sample Plots
First, observe the sample plots found with the sPLS-DA SRBCT Case Study – seen in Figure 1. The components are useful discriminators between classes, for instance the first component is best suited to defining the BL
class from the others. Overall, the classes separate quite well (save for the NB
and RMS
classes on the second component).
FIGURE 1: Sample plots from sPLS-DA performed on the SRBCT gene expression data including 95% confidence ellipses. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3. Samples are coloured by their tumour subtypes.
The equivalent plots using the permuted labels can be seen in Figure 2. There is a stark decrease in the quality of separation of classes. Large degrees of overlap are present across all three components. The lack of distinction between the classes is a key indicator of a failure of the sPLS-DA methodology to produce useful components.
FIGURE 2: Sample plots from sPLS-DA performed on the SRBCT gene expression data after class label permutation. Samples are projected into the space spanned by the first three components. (a) Components 1 and 2 and (b) Components 1 and 3.
Variable Plots
Next, the correlation circle plots will be evaluated. The features of the permuted data seemingly have a lower average, absolute correlation with the sPLS-DA components. The clusters of features are also more dispersed. Separation of these features along the first component is severely reduced.
In combination, Figures 1 and 3 provide useful insights into what features are associated with specific class labels (ie. features positively correlated with the first component are likely key features in defining the BL
class). The same inferences cannot be made using the permuted form of this data (Figures 2 and 4).
FIGURE 3: Correlation circle plot representing the genes selected by sPLS-DA performed on the SRBCT gene expression data. Gene names are truncated to the first 10 characters. Only the genes selected by sPLS-DA are shown in components 1 and 2.
FIGURE 4: Correlation circle plot representing the genes selected by sPLS-DA performed on the SRBCT gene expression data after class label permutation. Gene names are truncated to the first 10 characters. Only the genes selected by sPLS-DA are shown in components 1 and 2.
Prediction Performance
Cross validated error rate is usually the best way to evaluate the performance of a sPLS-DA model [1]. It is the key indicator of when it is appropriate to use this type of model. Other metrics such as precision, recall and the F1 score are also useful indicators, but are secondary to that of the error rate.
Across 5 folds and 100 repeats, these four performance metrics are shown for each of the classes.
## [1] "Normal model performance metrics" ## EWS BL NB RMS ## error 0.03716667 0.01316667 0.01183333 0.0265000 ## precision 0.96208095 0.81416667 0.91503333 0.9543905 ## recall 0.93469048 0.83733333 0.93926667 0.9567524 ## f1 0.94039082 0.81886667 0.92013333 0.9487594 ## [1] "Permuted model performance metrics" ## RMS EWS NB BL ## error 0.4238333 0.5235000 0.38550000 0.30416667 ## precision 0.2647262 0.2745008 0.05478333 0.02640476 ## recall 0.2320190 0.2574071 0.06326667 0.03700000 ## f1 0.2216599 0.2381645 0.05078759 0.02645714
It is clear that in the permuted case, all these metrics worsen (ie. error increases; precision, recall and F1 decrease). In the normal case, there is a set of signal features, each of which are involved in defining one or more classes from the remaining classes. By permuting the labels, the samples which allowed the model to determine which features were associated with defining a specific class are no longer part of that class. This mimics data that has few signal features, representing the case where sPLS-DA becomes an ineffective classifier and feature selection tool.