Within mixOmics
, the predict()
,
tune()
and perf()
/perf.assess()
functions all assign new observations a predicted class. In these cases,
a categorical, supervised algorithm ((s)PLS-DA, DIABLO, MINT (s)PLS-DA)
is being used. There is no one set way of determining which class is the
most appropriate to assign to a new sample. This package implements
three differents metrics to use:
max.dist
,centroids.dist
, andmahalanobis.dist
.Prior to determining which of these metrics is valid to use in a
given context, the functionality of these within the package should be
discussed. When undergoing any form of classification, there is an
outcome vector (\(y\))
of length \(N\) with \(K\) levels. Internally, this is “dummy”
encoded into matrix \(Y\) of size \(N\) x \(K\) where each column represents one
outcome level. Each row of this matrix will contain all 0
’s
except for in the column that corresponds to the outcome level of that
sample (it will contain a 1
in this cell).
In a three class problem (classes: A
, B
and
C
), \(y\)
will look something like the vector below:
y
## [1] "A" "B" "A" "C" "C" "A" "C" "B" "A" "A"
When this outcome is dummy encoded, the matrix representation looks like:
dummy.y
## A B C
## 1 1 0 0
## 2 0 1 0
## 3 1 0 0
## 4 0 0 1
## 5 0 0 1
## 6 1 0 0
## 7 0 0 1
## 8 0 1 0
## 9 1 0 0
## 10 1 0 0
When a supervised model (eg. PLS-DA) is handed a set of novel
samples, it will generate \(Y_{new}\). This new matrix mimics
\(Y\) in its structure
but instead of 1
’s and 0
’s, each cell contains
a score based on the likelihood of that sample belonging to each
class.
In an example where there are 5 novel samples, the internal \(Y_{new}\) would look something like:
Y.new
## A B C
## 1 0.4161 0.6158 0.9278
## 2 0.6580 0.1450 0.9388
## 3 0.5796 0.3854 0.1769
## 4 0.3553 0.0631 0.1110
## 5 0.2315 0.1139 0.1196
There is also the \(T_{pred}\) matrix which is yielded when undergoing classification. This matrix represents the predicted components of each of the new samples and is of size \(N_{new}\) x \(H\) where \(N_{new}\) is the number of novel samples and \(H\) is the number of components in the model.
This is the simplest and most intuitive approach for predicting the class of a test sample. For each new sample, the class with the largest predicted score (‘dummy score’) is the assigned class. This metric performs quite well in single dataset analysis contexts (with multiple classes) but loses efficacy when used in other problems.
The predictions of the samples in the example \(Y_{new}\) above would be:
max.dist.pred
## [1] "C" "C" "A" "B" "A"
This metric is less obvious in how it is calculated but is more
robust than max.dist
. First, for each of the \(K\) classes, the centroid (\(G_{k}\)) is calculated using all the
training samples associated with that class. The values of these samples
on the \(H\) components are used to
yield \(G_{k}\).
Using the values in \(T_{pred}\), each test sample has the Euclidean distance to each \(G_{k}\) calculated. The centroid that minimises this distance (within the \(H\) component space) is the class that is assigned to that sample.
Classifications made using this metric are less susceptible to
outliers within the training set. This metric is best used when the
classes cluster moderately well - which can be determined by plotting
the samples via the plotIndiv()
function.
This last metric is very similar to the centroids.dist
metric. The centroids are all calculated based on the position of each
training sample in the \(H\) component
space. Then, each test sample’s projection onto the components has its
distance from each centroid calculated. However, this metric uses the
Mahalanobis distance rather than the Euclidean distance. This distance
takes into account the correlation between each of the components,
giving more weight to less correlated components [1].
The below figures utilise the background.predict()
function in order to produce the coloured sample plots. For a breakdown
of the usage of this function (and the source of the figures), refer to
the sPLS-DA SRBCT Case Study.
The performance of each metric on partitioning the space spanned by
the two components into regions for each class is seen in Figure 1.
Depending on the degree to which each class clusters, the different
distance metrics perform better or worse. For instance, the
EWS
class (blue) samples are all found in the correct
region when using maximum distance. The non-linear boundary produced by
the Mahalanobis distance leaves one EWS
sample in the wrong
region. However, the Mahalanobis distance yields much more specific
boundaries and only misclassifies this single sample across all
classes.