Webinar: PLS methods

This webinar was presented for a seminar to a group of quantitative researchers (mostly statisticians) at the University of Melbourne. Abstract is below.

Topics covered: context of data integration, PCA solved with NIPALS algorithm and SVD, sparse PCA, correlation circle plot interpretation, PLS algorithms and deflation modes, sparse PLS.

Technological improvements have allowed for the collection of data from different types of molecules (e.g. genes, proteins, metabolites, microorganisms) resulting in multiple ‘omics data (e.g. transcriptomics, proteomics, metabolomics, microbiome) measured from the same set N of biospecimens or individuals. In this talk I will introduce the statistical integration of these multi-omics data to shed more light into a biological system.

Integrating data include numerous challenges – data are complex and large, each with few samples (N < 50) and many molecules (P > 10,000), and generated using different technologies. I will present PLS (Partial Least Squares / Projection to Latent Structures developed by Wold in the 1980s) as an algorithm of choice for data integration of small N large P problems. These variants form the basis of our comprehensive mixOmics R package for feature selection, dimension reduction and integration of omics data sets. This talk is targeted at a general audience with background knowledge in statistics and interest in large data

The webinar was re-recorded for the PLS section.

Webinar: Time-course multi-omics integration

I presented this talk for a group of statisticians at the Australian National University in Canberra. The abstract is below.

Topics covered: linear mixed model splines, multi-omics integration (PLS multiblock), correlation circle plot interpretation, timeOmics.

Longitudinal experiments are becoming increasingly popular in omics studies to monitor molecular changes following treatment or during disease progression. Integrating these data sets can give us some mechanistic insights into the different types of omics layers.

However, longitudinal omics data present numerous challenges including a small number of time points that may be unevenly spaced and unmatched between different data types, a small number of individuals, and a high individual variability. While current approaches have focused on differential expression across time or time profile clustering, the modelling of omics time profiles in a multivariate manner is critically lacking to understand longitudinal biological interactions.

I will present a statistical framework, timeOmics, to identify correlated profiles over time and between omics (transcriptomics, metabolomics, microbiome) to give insights into the molecular dynamics of biological systems and discuss future avenues of research in this expanding area.

Some key references

The timeOmics package

timeOmics is currently not directly available from the mixOmics package, instead it is a separate R package hosted on Bioconductor. See the Bioconductor page for installation instructions.

Webinar: Φ-Space for continuous phenotyping of single-cell multi-omics data

We have developed a new PLS method for cell type continuous annotation of single cells, now in preprint!

  • Φ-Space addresses numerous challenges faced by state-of-the-art automated annotation methods:
    • to identify continuous and out-of-reference cell states,
    • to deal with batch effects in reference,
    • to utilise bulk references and multi-omic references.
  • Φ-Space uses soft classification to phenotype cells on a continuum. The continuous annotation, or phenotype space embedding is then used to reduce the dimensionality of the data for various downstream analyses.

Φ-Space: Continuous phenotyping of single-cell multi-omics data. Jiadong Mao, Yidi Deng, Kim-Anh Lê Cao. bioRxiv 2024. 

View this 52min video of Kim-Anh Lê Cao presenting Φ-Space at the WEHI Bioinformatics seminar:

Abstract

Single-cell multi-omics technologies have empowered increasingly refined characterisa- tion of the heterogeneity of cell populations. Automated cell type annotation methods have been developed to transfer cell type labels from well-annotated reference datasets to emerging query datasets. However, these methods suffer from some common caveats, including the failure to characterise transitional and novel cell states, sensitivity to batch effects and under-utilisation of phenotypic information other than cell types (e.g. sample source and disease conditions).

We developed Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. In Φ-Space we adopt a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes. The phenotype space embedding enables various downstream analyses, including insightful visualisations, clustering and cell type labelling.

We demonstrate through three case studies that Φ-Space (i) characterises develop- ing and out-of-reference cell states; (ii) is robust against batch effects in both reference and query; (iii) adapts to annotation tasks involving multiple omics types; (iv) over- comes technical differences between reference and query.

The Φ-Space package

Φ-Space is currently not directly available from the mixOmics package, instead it is a separate R package that can be installed from Github.

Webinar: PCA and PLS-DA

These two recordings were part of a presentation to WEHI for their postgraduate lecture series for a diverse audience.

In the PCA presentation (18 min), we explain the concept of linear combination of variables (components) and useful graphical outputs such as correlation circle plots and biplots.

In the PLS-DA presentation (7 min), we talk about the concept of multivariate signature.

If you want to know more about the actual algorithm under the hood, you can watch this webinar on PLS.

Webinar: Microbial network inference for longitudinal microbiome studies with LUPINE

Our latest method based on PLS to infer microbial networks across time is now in preprint!

  • LUPINE is a PLS-based method that combines dimension reduction, and partial correlations to infer associations between taxa.
  • LUPINE takes into account information across time points
  • LUPINE has been designed for relatively small sample sizes and small number of time points

Microbial network inference for longitudinal microbiome studies with LUPINE. Saritha Kodikara, Kim-Anh Lê Cao. bioRxiv 2024.05.08.593086; 

View this 50min video of Dr Saritha Kodikara presenting her method LUPINE:

We also have a second video presented by Prof Kim-Anh Lê Cao who sets LUPINE in the context of microbiome longitudinal data analysis, elaborating more on the types of analytical objects covered in Kodikara et al. (2022) Statistical challenges in longitudinal microbiome data analysisBriefings in Bioinformatics.

Below you will also find the most common questions related to LUPINE.

FAQ:

Q: Do you build up the network from the covariance matrix or from the inverse covariance matrix? And what are you doing linear regression on?

A: The network is built on the partial correlation so it would be similar to the inverse covariance matrix. But instead of estimating the inverse covariance matrix, we calculate partial correlations through linear regression. To estimate the partial correlation between taxa a and taxa b, we regress their counts on the low dimensional representation of other taxa (excluding taxa a and b). This is then repeated for all pairs (we have an efficient way to do this computationally).

Q: You reduce the dimension of the data into one dimension. How much variance can be explained by the 1st component in your computation?

A: It depends on the data, but in the data we analysed, and if consider the single time point scheme only with PCA, the first component explained about 25% of the total variance. We could add more components into the regression but that may overfit the regression model. This is why we only select the first component, which explains much of the variance (for PCA, single time point) or covariance (for PLS, multiple time points).

Q: Do you think that this approach would work on single cell data trying to look at gene co expression in sort of longitudinal data in across time points?

A: It will not work with the present single cell technologies, because in LUPINE we need the same individuals/samples/cells across time to infer the association.

Q: When you do the linear regression, do you regress directly on the counts with all the zeros and the sparsity that you mentioned?

A: Yes, the method was originally developed for count data. We regress on the count data, but we also include library size as an offset to account for different library sizes. The method also works with center log ratio values, which I used to analyse the third case study.

Q: Do you apply your method for the two groups combined or separately?

A: I model each group separately as we assume that each group has a unique network.

Q: You’re building the networks building based on the partial correlations. What about the actual network for representation, do you actually binarize it?

A: Yes, I binarize the network based on a correlation test.

The LUPINE package

LUPINE is currently not directly available from the mixOmics package, instead it is a separate R package that can be installed from Github.

Webinar: mixOmics in 50 minutes

This latest seminar was hosted by Australian BioCommons / EMBL-ABR / ARDC  in March 2024.

The latest version includes some recent updates (also covered in the other webinars in more details – check them out!)

The slides are opened to the community, but don’t forget to acknowledge the presenter if you are re-using the slides.

Multi-omics data (eg. transcriptomics, proteomics) collected from the same set of biospecimens or individuals is a powerful way to understand the underlying molecular mechanisms of a biological system. 

mixOmics, a popular R package, integrates omics data from a wide range of sources into a single, unified view making it easier to explore and reveal interactions between omics layers. It overcomes many of the challenges of multi-omic data integration arising from data that are complex and large, with few samples (<50) and many molecules (>10,000), and generated using different technologies. 

Prof Kim-Anh Lê Cao, head of the mixOmics team, is delivering this webinar to outline the different methods implemented in mixOmics and how statistical data integration is defined in this context. She will demonstrate how these approaches are applied to analysis of different multi-omics studies and outline the latest methodological developments in this area. From a study of human newborns, to multi-omics microbiomes, and multi-omics in single cells, these examples illustrate how mixOmics is used to perform variable selection and identify a signature of omics markers that characterise a specific phenotype or disease status. 

Who the webinar is for:This webinar is for life scientists, bioinformaticians and anyone with an interest in exploration and integration of multiomics biological datasets.

Topics covered: omics data statistical integration, introduction to matrix factorisation techniques, applications of DIABLO and MINT frameworks for bulk or single cell assays, extensions.

The slides are opened to the community, but don’t forget to acknowledge the presenter if you are re-using the slides.

Any mixOmics related question can be send to  https://mixomics-users.discourse.group (you will need to login but there is not mail traffic associated)