library(mixOmics) # import the mixOmics library
data(breast.TCGA) # extract the TCGA data
X1 <- breast.TCGA$data.train$mirna # use the mirna and mrna expression levels as
X2 <- breast.TCGA$data.train$mrna # the X datasets
X <- list(mirna = X1, mrna = X2)
Y <- breast.TCGA$data.train$protein # set the protein levels as the Y dataset
block.pls.result <- block.pls(X, Y, design = "full") # run the method
plotIndiv(block.pls.result) # plot the samples
plotVar(block.pls.result, legend = TRUE) # plot the variables
?block.pls
can be run to determine all default arguments
of this function:
ncomp = 2
): The first two PLS
components are calculated,design = "full"
): The strength of all
relationships between dataframes is maximised (= 1) - a “fully
connected” design,mode = regression
): A PLS regression mode is
performed,scale = TRUE
): Each block is
standardised to zero means and unit variances.# set the number of features to use for the X datasets
list.keepX = list(mrna = rep(5, 2), mirna = rep(5,2))
# set the number of features to use for the Y dataset
list.keepY = c(rep(10, 2))
block.spls.result <- block.spls(X, Y, design = "full", # run the method
keepX = list.keepX, keepY = list.keepY)
# plot the contributions of each feature to each dimension
plotLoadings(block.pls.result, ncomp = 1)
plotIndiv(block.pls.result) # plot the samples
plotVar(block.pls.result, legend = TRUE) # plot the variables
?block.spls
can be run to determine all default
arguments of this function:
block.pls
,keepX
, keepY
): If
unspecified, these values will default to using all features of the
original dataframes.Note, “multiblock” will be abbreviated to “MB” within this page.
Prior to learning the functionality of the MB-(s)PLS methods, a solid understanding of standard (s)PLS is strongly advised. The MB variants are extensions of the (s)PLS methods for when more than two datasets are being assessed. This draws on the methodology of Generalised CCA [1]. Here, there are multiple predictor datasests (X1, … XQ) and a response vector/matrix of continuous values (y / Y). As with the standard forms of these methods, MB-sPLS is the sparse variant of MB-PLS and uses feature selection when forming latent components. When dealing with high dimensional datasets, MB-sPLS would be the recommended method as the non sparse version suffers from a lack of interpretability in this contexts.
MB-(s)PLS features the same four modes of operation as (s)PLS,
including “regression”, “canonical”, “invariant” and “classic”. These
each function the same way they do in (s)PLS. However, the
block.pls()
and block.spls()
functions do not
have the same tuning and performance assessment methods when compared to
pls()
, spls()
, block.plsda()
and
block.splsda()
.
For a breakdown of how to construct the design
matrix,
refer to the ‘N-Integration Methods page’
There are two ways in which the response dataset can be specified. If
it is included in the list of datasets passed in via the X
parameter, then indY
passes the index of the desired Y
dataframe to the function. In the below examples, the protein expression
data is set to be the response dataframe:
X1 <- breast.TCGA$data.train$mirna
X2 <- breast.TCGA$data.train$mrna
X3 <- breast.TCGA$data.train$protein
X <- list(mirna = X1, mrna = X2, protein = X3)
block.pls.result <- block.pls(X, indY = 3)
The alternative method is to have the desired Y dataframe totally separate to all the X datasets:
X1 <- breast.TCGA$data.train$mirna
X2 <- breast.TCGA$data.train$mrna
X <- list(mirna = X1, mrna = X2)
Y <- breast.TCGA$data.train$protein
block.pls.result <- block.pls(X, Y = Y)