Single Cell Gastrulation Multiblock sPLS Case Study, but with inappropriate data

The aim here is to raise awareness to cases where the various datasets being considered by the model are inappropriate and/or poorly integrated with one another.

Getting started

For simplicity’s sake, a basic design matrix will be used, such that the relationship coefficient between each dataset will be 0.5. Also, the arbitrarily selected keepX value of 50 will be used for each of the 3 components across each dataset.

Plotting the sample projections can provide insight into the quality of the model. Looking at Figure 1 (same as in the Multiblock sPLS Gastrulation Case Study), it seems that there is moderate separation of some of the lineage classes. However, when plotting the samples from the two accessibility dataframes (Figures 2 and 3), it is clear that the multiblock sPLS method has failed to produce useful components for this data.

Observe the projection of the endoderm samples within Figure 1 compared to Figure 3. In the former, the second latent component (and the first to a lesser extent) is able to separate this class from the others. While this is a unsupervised method, this represents components which are useful. In the latter however, they are almost randomly scattered amongst the other classes. This goes to show the lack of a correlated signal for this class across the two datasets. This concept can be extended to the other classes as similar behavior is observed.

The sPLS (and multiblock variant) method seeks to maximise the covariance between the components of each dataset. It seems to have been unsuccessful in this case.

FIGURE 1: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the RNA dataset.

FIGURE 2: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the gene body accessibility dataset.

FIGURE 3: Sample plot for sPLS2 performed on the gastrulation data. Samples are projected into the space spanned by the components yielded from the promoter accessibility dataset.

These figures go to show the importance of inspecting each and every plot produced by the methods within the mixOmics package - especially when integrating multiple datasets. Here, the model has attempted to yield components with high degrees of covariance and in doing so has produced very useless components.

In this scenario, it would be recommended to rerun the analysis with fewer datasets. An even better starting point would be to use the basic spls() method across the data in a pairwise manner rather than the block.spls() method.