The aim here is to raise awareness to cases where the various datasets being considered by the model are inappropriate and/or poorly integrated with one another.
For simplicity’s sake, a basic design matrix will be used, such that
the relationship coefficient between each dataset will be
0.5
. Also, the arbitrarily selected keepX
value of 50 will be used for each of the 3 components across each
dataset.
Plotting the sample projections can provide insight into the quality of the model. Looking at Figure 1 (same as in the Multiblock sPLS Gastrulation Case Study), it seems that there is moderate separation of some of the lineage classes. However, when plotting the samples from the two accessibility dataframes (Figures 2 and 3), it is clear that the multiblock sPLS method has failed to produce useful components for this data.
Observe the projection of the endoderm
samples within
Figure 1 compared to Figure 3. In the former, the second latent
component (and the first to a lesser extent) is able to separate this
class from the others. While this is a unsupervised method, this
represents components which are useful. In the latter however, they are
almost randomly scattered amongst the other classes. This goes to show
the lack of a correlated signal for this class across the two datasets.
This concept can be extended to the other classes as similar behavior is
observed.
The sPLS (and multiblock variant) method seeks to maximise the covariance between the components of each dataset. It seems to have been unsuccessful in this case.
FIGURE 1: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the RNA dataset.
FIGURE 2: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the gene body accessibility dataset.
FIGURE 3: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the promoter accessibility dataset.
These figures go to show the importance of inspecting each and every
plot produced by the methods within the mixOmics
package -
especially when integrating multiple datasets. Here, the model has
attempted to yield components with high degrees of covariance and in
doing so has produced very useless components.
In this scenario, it would be recommended to rerun the analysis with
fewer datasets. An even better starting point would be to use the basic
spls()
method across the data in a pairwise manner rather
than the block.spls()
method.