Webinar: Φ-Space for continuous phenotyping of single-cell multi-omics data

We have developed a new PLS method for cell type continuous annotation of single cells, now in preprint!

  • Φ-Space addresses numerous challenges faced by state-of-the-art automated annotation methods:
    • to identify continuous and out-of-reference cell states,
    • to deal with batch effects in reference,
    • to utilise bulk references and multi-omic references.
  • Φ-Space uses soft classification to phenotype cells on a continuum. The continuous annotation, or phenotype space embedding is then used to reduce the dimensionality of the data for various downstream analyses.

Φ-Space: Continuous phenotyping of single-cell multi-omics data. Jiadong Mao, Yidi Deng, Kim-Anh Lê Cao. bioRxiv 2024. 

View this 52min video of Kim-Anh Lê Cao presenting Φ-Space at the WEHI Bioinformatics seminar:

Abstract

Single-cell multi-omics technologies have empowered increasingly refined characterisa- tion of the heterogeneity of cell populations. Automated cell type annotation methods have been developed to transfer cell type labels from well-annotated reference datasets to emerging query datasets. However, these methods suffer from some common caveats, including the failure to characterise transitional and novel cell states, sensitivity to batch effects and under-utilisation of phenotypic information other than cell types (e.g. sample source and disease conditions).

We developed Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. In Φ-Space we adopt a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes. The phenotype space embedding enables various downstream analyses, including insightful visualisations, clustering and cell type labelling.

We demonstrate through three case studies that Φ-Space (i) characterises develop- ing and out-of-reference cell states; (ii) is robust against batch effects in both reference and query; (iii) adapts to annotation tasks involving multiple omics types; (iv) over- comes technical differences between reference and query.

Our book is out!

We are excited to announce that our book is out, along with several case studies and R scripts available online. Check out this page.

It’s been a very (very) long term project, and a great collaboration with Zoe Welham whose dedication and patience helped shape this project into a readable whole! A huge thank you to Al Abadi, who tirelessly helped updating the package as we developed the content.

Multi-omics data integration: method and showcase applications

Lê Cao team and collaborators from University of British Columbia (Vancouver, Canada) have published their first method to integrate multiple omics data from the same set of biospecimens or individuals (e.g. transcriptomics, proteomics). Their method adopts a systems biology holistic approach by statistically integrating data from multiple biological compartments. Such approach provides improved biological insights compared with traditional single omics analyses, as it allows to take into account interactions between omics layers and extract multi-omics molecular networks.

DIABLO is a multivariate dimension reduction method and is hypothesis-free. The method constructs combinations of variables (e.g. cytokines, transcripts, proteins, metabolites) that are maximally correlated across data types to identify a minimal subset of markers – a multi-omics signature. This signature can highlight novel findings but is also the starting point to network modelling.

More information about DIABLO, implemented in the mixOmics R package: Amrit Singh, Casey P Shannon, Benoît Gautier, Florian Rohart, Michaël Vacher, Scott J Tebbutt and Kim-Anh Lê Cao (2019) DIABLO: an integrative approach for identifying key molecular drivers from multi-omics assaysBioinformatics. You can also find some technical information in the mixOmics paper (particularly in the Supp!) and also in our tutorials here.

While the computational researchers where busy developing their method, they also analysed the data from the #SmallBig study (small sample, big data) with the EPIC (Expanded Program on Immunization) Consortium. EPIC comprises researchers from the Boston Children’s Hospital, University of British Columbia, Medical Research Council Unit The Gambia, Université libre de Bruxelles, Telethon Kids Institute and University of Western Australia, the Papua New Guinea Institute for Medical Research, to answer the question: What can less than 1mL of blood tell us about a newborn’s health?

Sample processing of the #SmallBig study (adapted from Lee et al. 2019)

In this study recently published in Nature Communications, the team has developed a technique to collect extremely small volumes of blood samples (< 1mL) to comprehensively characterise how biological molecules evolve in newborns. Using cutting-edge computational and statistical methods including DIABLO, they show that to the contrary to biology in adults that has a relatively steady-state, the first week of human life is highly dynamic and undergoes dramatic changes. Their results were consistently observed in vastly different areas of the world, West Africa (The Gambia) and Australasian (Papua New Guinea) and suggest a purposeful rather than random developmental path.

More information about the SmallBig study: Amy H. Lee, Casey P. Shannon, […]Tobias R. Kollmann (2019). Dynamic molecular changes during the first week of human life follow a robust developmental trajectory Nature Communications volume 10, Article number: 1092.

If you are interested in the potential of DIABLO to integrate microbiome and omics from the host, here is another study we published. We integrated the microbiome, proteome and meta-proteomics in T1D individuals.

Design of the multi-omics microbiome study

Identification of multi-omics signature from Gavin et al 2018.

More details about the study: Gavin PG, […], and Hamilton-Williams EE (2018). Intestinal metaproteomics reveals host-microbiota interactions in subjects at risk for type 1 diabetes Diabetes care 41: 10. We used DIABLO to integrate microbiome, proteomics and meta-proteomics.

New publication with multiple integration

Our paper ‘Novel Multivariate Methods for Integration of Genomics and Proteomics Data: Applications in a Kidney Transplant Rejection Study‘ has just been accepted in OMICS: a journal of integrative Biology, from a collaboration with scientists from the PRevention Of Organ Failure (PROOF), University of British Columbia.

It provides a nice case study with the application of PCA, IPCA, sPLS-DA and sGCCA (now implemented in mixOmics with the function wrapper.sgcca()).

Contact us for more details if needed.

Abstract

Multi-omics research is a key ingredient of data-intensive life sciences research, permitting measurement of biological molecules at different functional levels in the same individual. For a complete picture at the biological systems level, appropriate statistical techniques must however be developed to integrate different ‘omics’ data sets (e.g., genomics and proteomics). We report here multivariate projection-based analyses approaches to genomics and proteomics data sets, using the case study of and applications to observations in kidney transplant patients who experienced an acute rejection event (n = 20) versus non-rejecting controls (n = 20). In this data sets, we show how these novel methodologies might serve as promising tools for dimension reduction and selection of relevant features for different analytical frameworks. Unsupervised analyses highlighted the importance of post transplant time-of-rejection, while supervised analyses identified gene and protein signatures that together predicted rejection status with little time effect. The selected genes are part of biological pathways that are representative of immune responses. Gene enrichment profiles revealed increases in innate immune responses and neutrophil activities and a depletion of T lymphocyte related processes in rejection samples as compared to controls. In all, this article offers candidate biomarkers for future detection and monitoring of acute kidney transplant rejection, as well as ways forward for methodological advances to better harness multi-omics data sets.

 

Article published explaining correlation circle plots, relevance networks and CIM

Our manuscript ‘Insightful graphicalt outputs to explore relationships between two “omics” data sets has been published and explains how to interpret Correlation Circle plots, how relevance networks and CIM are generated from rCCA and sPLS.

Check this very colourful manuscript[intlink id=”202″ type=”page”]here[/intlink]!

General presentation about mixOmics

A new general presentation about mixOmics is available (and should be updated for major update of the package) in the [intlink id=”204″ type=”page”]Presentation Section[/intlink].

Lê Cao K.-A. Unravelling `omics’ data with the mixOmics R package, Illustration on several studies. General presentation on mixOmics (last updated 05/04/2012) [Presentation]

(s)IPCA

Independent Principal Component Analysis (IPCA)

In some case studies, we have identified some limitations when using PCA:

  • PCA assumes that gene expression follows a multivariate normal distribution and recent studies have demonstrated that microarray gene expression measurements follow instead a super-Gaussian distribution
  • PCA decomposes the data based on the maximization of its variance. In some cases, the biological question may not be related to the highest variance in the data

Instead, we propose to apply Independent Principal Component Analysis (IPCA) which combines the advantages of both PCA and Independent Component Analysis (ICA). It uses ICA as a denoising process of the loading vectors produced by PCA to better highlight the important biological entities and reveal insightful patterns in the data.

IPCA offers a better visualization of the data than ICA and with a smaller number of components than PCA.

How to choose the number of components:

The kurtosis measure is used to order the loading vectors to order the Independent Principal Components.  We have shown that the kurtosis value is a good post hoc indicator of the number of components to choose, as a sudden drop in the values corresponds to irrelevant dimensions.

Sparse Independent Principal Component Analysis (sIPCA)

Similar to the [intlink id=”129″ type=”page”]sparse PCA[/intlink] version implemented in mixOmics, soft-thresholding is applied in the independent loading vectors in IPCA to perform internal variable selection.

How to choose the number of variables to select:

The number of variables to select is still an open issue. In our paper we proposed to use the Davies Bouldinmeasure which is an index of crisp cluster validity. This index compares the within-cluster scatter with the between-cluster separation.

More details about how to use the ipca.R function in the[intlink id=”233″ type=”page”] case study[/intlink].

References

New methods: multilevel analyses

A multilevel approach has been added for cross-over design experiments (up to two cross factors), in collaboration with A/Prof B. Liquet (Universite de Bordeaux, France).  This approach takes into account the complex structure of repeated measurements from different assays, where different treatments are applied on the same subjects to highlight the treatment effects within subject separately from the biological variation between subject.

Two different frameworks are proposed:

  • discriminant analysis (method = ‘splsda’) enables the selection of features separating the different treatments
  • integrative analysis  (method = ‘spls’) enables the interaction of two matched data sets and the selection of subset of correlated variables (positively or negatively) across the samples. The approach is unsupervised: no prior knowledge about the samples groups is included.

The multilevel function first decomposes the variance in the data sets X (and Y) and applies either sPLS-DA or sPLS on the within-subject deviation. One or two-factor analyses are available for sPLS-DA.

Associated functions include: multilevel.R, tune.multilevel.R, pheatmap.multilevel.R (see examples in methods, graphics and case studies).

This is our first step towards repeated measurements designs.

The package has been updated to version 4.0-1 to implement these methodologies. It now requires the library ‘pheatmap’.

Web-interface

  • R package and Methods: IPCA and sparse IPCA functions have been implemented (as well as their associated S3 functions). IPCA stands for Principal Component Analysis with Independent Loadings. It is a combination of the advantages of both PCA and Independent Component Analysis (ICA). PCA is a powerful exploratory tool if the biological question is related to the highest variance. ICA was recently proposed in the literature as an alternative to PCA as it optimizes an independence condition that can give more meaningful components. A preprint can be available upon request.
  • R package and Data: The Liver Toxicity study data has been updated to provide geneBank IDs and gene titles
  • R package and Data: Two other data sets have been added: Prostate Tumor study (gene expression) and Metabolomic study of Yeast (metabolomics).
  • Web interface: We are making good progress on our associated web-interface (now deployed on  http://mixomics.qfab.org). Few illustrative examples are also available, and you can download the illustrative examples and run any type of analysis trough the interface. We are currently developing a ‘next level analysis’ to provide pathway enrichment analyses and give the functional annotation of the selected genes using the iHOP database. Do not hesitate to give us some feedback!
  • webinterface
  • ‘sletter: we now have a newsletter, to subscribe, send an email to mixomics[at]math.univ-toulouse.fr with no subject in the body.