Case Study of sPCA with Multidrug dataset

Principal Component Analysis (PCA) is primarily used to explore one single type of ’omics data (e.g. transcriptomics, proteomics, metabolomics, etc) and identify the largest sources of variation. We often use PCA as a preliminary step to better understand the data. In high dimensionality contexts, sPCA is desirable over PCA.

To begin

Load the latest version of mixOmics:

library(mixOmics)

The data

The multidrug dataset is used by the National Cancer Institute to screen for anticancer activity. The data come from a pharmacogenomic study (Szakács et al., 2004).

The mixOmics multidrug data set is accessed via multidrug, and contains the following:

multidrug$ABC.trans (continuous matrix): 60 rows and 48 columns. The expression of the 48 human ABC transporters for the 60 cell lines.
multidrug$compound (continuous matrix): 60 rows and 1429 columns. The activity of 1429 drugs for the 60 cell lines.
multidrug$comp.name (character vector): The names or the NSC No. of the 1429 compounds.
multidrug$cell.line (list of two character vectors): First vector (Sample) are the sample names of the 60 cell line which were analysed. Second vector (Class) is the phenotypes of the 60 cell lines. The NCI-60 panel includes cell lines derived from cancers of colorectal (7 cell lines), renal (8), ovarian (6), breast (8), prostate (2), lung (9) and central nervous system origin (6), as well as leukemias (6) and melanomas (8).

First, the ABC transported data is examined by calling multidrug$ABC.trans. The dimensions of this dataset are confirmed:

data(multidrug) # call multidrug dataset
X <- multidrug$ABC.trans # extract ABC transporter data
dim(X) # confirm the dimension of data

## [1] 60 48

Initial analysis

To begin analysis, the data can be explored using unoptimised parameters. This will provide preliminary information about the structure of the dataset and aid in tuning sPCA. First we construct a simple model using an arbitrary number of components and all variables. Then, it is plotted to assess the explained variance:

# run preliminary model
trans.spca <- spca(X, ncomp = 10, center = TRUE, scale = TRUE) 
plot(trans.spca) # plot the explained variance per component

FIGURE 1: Explained variance of Principal Components on the Multidrug ABC Transporter data

Note that Figure 1 can also be achieved using the tune.pca() function, as described further below. The corresponding code would be:

# extract the percentages of explained variance
explainedVariance <- tune.pca(X, ncomp = 10, center = TRUE, scale = FALSE) 
plot(explainedVariance) # plot these values

Tuning sPCA

Transforming the data

The scale and centre Parameters

Differing ranges and scales of variables introduces bias into the output of the PCA function. Therefore, data can be centered and/or scaled when feature variance is not homogeneous. The relevant parameters default to center = TRUE and scale = FALSE. These can be manually adjusted by the user in cases where this is not desired. If the variance across the variables being used are similar, then only centering would be required. However, if the variance of each variable is vastly different (which can be induced by differing scales), scaling will be required.

Selecting the number of components

The ncomp Parameter

PCA is meant to reduce the complexity of a dataset and summarise it in the minimal number of dimensions possible. The number of PCs to be retained is a crucial decision in PCA. This choice can be informed by the proportion of variance that is captured by each of the PCs. Figure 1 assists in deciding the number of PCs to retain.

A common method for selection is finding the ‘elbow’ in the barplot. Beyond the ‘elbow’, the decrease in explained variance will stabilise for remaining PCs. PCs beyond this point usually provide negligible returns in terms of the information they explain.

Given Figure 1, two or three PCs would likely be the most suitable selection. The value selected from tune.pca can be inputted into the ncomp parameter within the pca function.

Selecting the number of variables

The keepX Parameter

When using sPCA, the parameter keepX controls the number of variables to use for the construction of each PC. Selecting a value for this parameter using explained variance becomes difficult as when less original variables are used. The total explained variance becomes deflated. Hence, the selected value for keepX is case dependent and should be optimised by the user. Cross-validation will improve the accuracy of this selection. Within the package, there is a tune.spca function which has a very similar role to tune.pca, but has the added functionality of aiding in optimising the number of selected variables.

set.seed(8589) # for reproducibility with this case study, remove otherwise
test.keepX <- c(seq(5, 25, 5)) # set the number of variable values to be tested

tune.spca.res <- tune.spca(X, ncomp = 3, # generate the first 3 components
                           nrepeat = 5, # repeat the cross-validation 5 times
                           folds = 3, # use 3 folds for the cross-validation
                           test.keepX = test.keepX)
plot(tune.spca.res) # plot the optimisation output

FIGURE 2: Tuning the number of variables to select with sPCA on the ABC Transporter data.

The optimisation output can be observed in Figure 2. For each component (coloured according to the legend), the optimal number of features to select is shown by the large diamonds. This is selected per component using a one-sided t-test. The y-axis depicts the average correlation between the predicted and actual components. This is cross-validated over folds folds and repeated nrepeat times.

The optimal number of features can be extracted from this object by the below call:

tune.spca.res$choice.keepX # how many variables per component is optimal

## comp1 comp2 comp3 
##    25    25     5

Final Model

Using these tuned parameters, the final model can now be run to yield an optimised sPCA visualisation.

final.spca <- spca(X, ncomp = 3, # based off figure 1, three components is best
                   keepX = tune.spca.res$choice.keepX)

Plots

Sample Plots

# plot final sPCA samples for first two components
plotIndiv(final.spca, comp = c(1, 2), ind.names = TRUE, 
          group = multidrug$cell.line$Class,  # use class to colour each sample
          legend = TRUE, title = 'Multidrug transporter, sPCA comp 1 - 2')

FIGURE 3: Sample plot from the sPCA performed on the ABC Transporter data. Samples are coloured by cell line type and numbers indicate the sample IDs

The projection of each sample onto the first two Principal Components can be seen in Figure 3. These are coloured according to their cancer type, such that clustering can be observed. The second component seems to separate most of the Melanoma cell lines from the other cancer types interestingly.

Variable Plots

# plot variables against the sPCA components
plotVar(final.spca, comp = c(1, 2), var.names = TRUE,  
        title = 'Multidrug transporter, sPCA comp 1 - 2')

FIGURE 4: Correlation circle plot from the sPCA performed on the ABC Transporter data. Only the transporters selected by the sPCA are shown on this plot.

Here, Figure 4 shows a correlation circle plot which highlights clusters of ABC transporters and shows their contribution to each principal component. For instance, features ABCD2 and ABCC12 are positively correlated with one another as well as contributed to a decent degree to the primary component.

# plot samples and variables against the sPCA components
biplot(final.spca, cex = 0.7, 
       xlabs = paste(multidrug$cell.line$Class, 1:nrow(X)), #simplify names
       group = multidrug$cell.line$Class,  # colour by sample class
       title = 'Multidrug transporter, sPCA comp 1 - 2')

FIGURE 5: Biplot from the sPCA performed on the ABS.trans data after variable selection. The plot highlights which transporter expression levels may be related to specific cell lines, such as melanoma.

The biplot seen in Figure 5 depicts how each feature may explain the positioning of each sample via the use of a biplot. For this, it can be seen that features ABCD1, ABCC2, ABCA9 and ABCB5 may be responsible for the clustering of the Melanoma samples.