Principal Component Analysis (PCA) is primarily used to explore one single type of ’omics data (e.g. transcriptomics, proteomics, metabolomics, etc) and identify the largest sources of variation. We often use PCA as a preliminary step to better understand the data. In high dimensionality contexts, sPCA is desirable over PCA.
Load the latest version of mixOmics:
library(mixOmics)
The multidrug dataset is used by the National Cancer Institute to screen for anticancer activity. The data come from a pharmacogenomic study (Szakács et al., 2004).
The mixOmics
multidrug data set is accessed via
multidrug
, and contains the following:
multidrug$ABC.trans
(continuous matrix): 60 rows and 48
columns. The expression of the 48 human ABC transporters for the 60 cell
lines.multidrug$compound
(continuous matrix): 60 rows and
1429 columns. The activity of 1429 drugs for the 60 cell lines.multidrug$comp.name
(character vector): The names or
the NSC No. of the 1429 compounds.multidrug$cell.line
(list of two character vectors):
First vector (Sample
) are the sample names of the 60 cell
line which were analysed. Second vector (Class
) is the
phenotypes of the 60 cell lines. The NCI-60 panel includes cell lines
derived from cancers of colorectal (7 cell lines), renal (8), ovarian
(6), breast (8), prostate (2), lung (9) and central nervous system
origin (6), as well as leukemias (6) and melanomas (8).First, the ABC transported data is examined by calling
multidrug$ABC.trans
. The dimensions of this dataset are
confirmed:
data(multidrug) # call multidrug dataset
X <- multidrug$ABC.trans # extract ABC transporter data
dim(X) # confirm the dimension of data
## [1] 60 48
To begin analysis, the data can be explored using unoptimised parameters. This will provide preliminary information about the structure of the dataset and aid in tuning sPCA. First we construct a simple model using an arbitrary number of components and all variables. Then, it is plotted to assess the explained variance:
# run preliminary model
trans.spca <- spca(X, ncomp = 10, center = TRUE, scale = TRUE)
plot(trans.spca) # plot the explained variance per component
FIGURE 1: Explained variance of Principal Components on the Multidrug ABC Transporter data
Note that Figure 1 can also be achieved using the
tune.pca()
function, as described further below. The
corresponding code would be:
# extract the percentages of explained variance
explainedVariance <- tune.pca(X, ncomp = 10, center = TRUE, scale = FALSE)
plot(explainedVariance) # plot these values
Differing ranges and scales of variables introduces bias into the
output of the PCA function. Therefore, data can be centered and/or
scaled when feature variance is not homogeneous. The relevant parameters
default to center = TRUE
and scale = FALSE
.
These can be manually adjusted by the user in cases where this is not
desired. If the variance across the variables being used are similar,
then only centering would be required. However, if the variance of each
variable is vastly different (which can be induced by differing scales),
scaling will be required.
PCA is meant to reduce the complexity of a dataset and summarise it in the minimal number of dimensions possible. The number of PCs to be retained is a crucial decision in PCA. This choice can be informed by the proportion of variance that is captured by each of the PCs. Figure 1 assists in deciding the number of PCs to retain.
A common method for selection is finding the ‘elbow’ in the barplot. Beyond the ‘elbow’, the decrease in explained variance will stabilise for remaining PCs. PCs beyond this point usually provide negligible returns in terms of the information they explain.
Given Figure 1, two or three PCs would likely be the most suitable
selection. The value selected from tune.pca
can be inputted
into the ncomp
parameter within the pca
function.
When using sPCA, the parameter keepX
controls the number
of variables to use for the construction of each PC. Selecting a value
for this parameter using explained variance becomes difficult as when
less original variables are used. The total explained variance becomes
deflated. Hence, the selected value for keepX
is case
dependent and should be optimised by the user. Cross-validation will
improve the accuracy of this selection. Within the package, there is a
tune.spca
function which has a very similar role to
tune.pca
, but has the added functionality of aiding in
optimising the number of selected variables.
set.seed(8589) # for reproducibility with this case study, remove otherwise
test.keepX <- c(seq(5, 25, 5)) # set the number of variable values to be tested
tune.spca.res <- tune.spca(X, ncomp = 3, # generate the first 3 components
nrepeat = 5, # repeat the cross-validation 5 times
folds = 3, # use 3 folds for the cross-validation
test.keepX = test.keepX)
plot(tune.spca.res) # plot the optimisation output
FIGURE 2: Tuning the number of variables to select with sPCA on the ABC Transporter data.
The optimisation output can be observed in Figure 2. For each
component (coloured according to the legend), the optimal number of
features to select is shown by the large diamonds. This is selected per
component using a one-sided t-test. The y-axis depicts the average
correlation between the predicted and actual components. This is
cross-validated over folds
folds and repeated
nrepeat
times.
The optimal number of features can be extracted from this object by the below call:
tune.spca.res$choice.keepX # how many variables per component is optimal
## comp1 comp2 comp3
## 25 25 5
Using these tuned parameters, the final model can now be run to yield an optimised sPCA visualisation.
final.spca <- spca(X, ncomp = 3, # based off figure 1, three components is best
keepX = tune.spca.res$choice.keepX)
# plot final sPCA samples for first two components
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE,
group = multidrug$cell.line$Class, # use class to colour each sample
legend = TRUE, title = 'Multidrug transporter, sPCA comp 1 - 2')
FIGURE 3: Sample plot from the sPCA performed on the ABC Transporter data. Samples are coloured by cell line type and numbers indicate the sample IDs
The projection of each sample onto the first two Principal Components can be seen in Figure 3. These are coloured according to their cancer type, such that clustering can be observed. The second component seems to separate most of the Melanoma cell lines from the other cancer types interestingly.
# plot variables against the sPCA components
plotVar(final.spca, comp = c(1, 2), var.names = TRUE,
title = 'Multidrug transporter, sPCA comp 1 - 2')
FIGURE 4: Correlation circle plot from the sPCA performed on the ABC Transporter data. Only the transporters selected by the sPCA are shown on this plot.
Here, Figure 4 shows a correlation circle plot which highlights
clusters of ABC transporters and shows their contribution to each
principal component. For instance, features ABCD2
and
ABCC12
are positively correlated with one another as well
as contributed to a decent degree to the primary component.
# plot samples and variables against the sPCA components
biplot(final.spca, cex = 0.7,
xlabs = paste(multidrug$cell.line$Class, 1:nrow(X)), #simplify names
group = multidrug$cell.line$Class, # colour by sample class
title = 'Multidrug transporter, sPCA comp 1 - 2')
FIGURE 5: Biplot from the sPCA performed on the ABS.trans data after variable selection. The plot highlights which transporter expression levels may be related to specific cell lines, such as melanoma.
The biplot seen in Figure 5 depicts how each feature may explain the
positioning of each sample via the use of a biplot. For this, it can be
seen that features ABCD1
, ABCC2
,
ABCA9
and ABCB5
may be responsible for the
clustering of the Melanoma samples.