This webinar was presented for a seminar to a group of quantitative researchers (mostly statisticians) at the University of Melbourne. Abstract is below.
Topics covered: context of data integration, PCA solved with NIPALS algorithm and SVD, sparse PCA, correlation circle plot interpretation, PLS algorithms and deflation modes, sparse PLS.
Technological improvements have
allowed for the collection of data from different types of molecules (e.g.
genes, proteins, metabolites, microorganisms) resulting in multiple ‘omics
data (e.g. transcriptomics, proteomics, metabolomics,
microbiome) measured from the same set N of biospecimens or individuals.
In this talk I will introduce the statistical integration of these multi-omics
data to shed more light into a biological system.
Integrating data include numerous
challenges – data are complex and large, each with few samples (N < 50) and
many molecules (P > 10,000), and generated using different technologies. I
will present PLS (Partial Least Squares / Projection to Latent Structures
developed by Wold in the 1980s) as an algorithm of choice for data integration
of small N large P problems. These variants form the basis of our comprehensive
mixOmics R package for feature selection, dimension reduction and integration
of omics data sets.
This talk is targeted at a general audience with
background knowledge in statistics and interest in large data
I presented this talk for a group of statisticians at the Australian National University in Canberra. The abstract is below.
Topics covered: linear mixed model splines, multi-omics integration (PLS multiblock), correlation circle plot interpretation, timeOmics.
Longitudinal experiments are becoming increasingly popular
in omics studies to monitor molecular changes following treatment or during
disease progression. Integrating these data sets can give us some mechanistic
insights into the different types of omics layers.
However, longitudinal omics data present numerous challenges
including a small number of time points that may be unevenly spaced and
unmatched between different data types, a small number of individuals, and a high
individual variability. While current approaches have focused on
differential expression across time or time profile clustering, the modelling
of omics time profiles in a multivariate manner is critically lacking to
understand longitudinal biological interactions.
I will present a statistical framework, timeOmics, to identify correlated profiles over time and between omics (transcriptomics, metabolomics, microbiome) to give insights into the molecular dynamics of biological systems and discuss future avenues of research in this expanding area.
timeOmics is currently not directly available from the mixOmics package, instead it is a separate R package hosted on Bioconductor. See the Bioconductor page for installation instructions.
We have developed a new PLS method for cell type continuous annotation of single cells, now in preprint!
Φ-Space addresses numerous challenges faced by state-of-the-art automated annotation methods:
to identify continuous and out-of-reference cell states,
to deal with batch effects in reference,
to utilise bulk references and multi-omic references.
Φ-Space uses soft classification to phenotype cells on a continuum. The continuous annotation, or phenotype space embedding is then used to reduce the dimensionality of the data for various downstream analyses.
View this 52min video of Kim-Anh Lê Cao presenting Φ-Space at the WEHI Bioinformatics seminar:
Abstract.
Single-cell multi-omics technologies have empowered increasingly refined characterisa-
tion of the heterogeneity of cell populations. Automated cell type annotation methods
have been developed to transfer cell type labels from well-annotated reference datasets
to emerging query datasets. However, these methods suffer from some common caveats,
including the failure to characterise transitional and novel cell states, sensitivity to
batch effects and under-utilisation of phenotypic information other than cell types
(e.g. sample source and disease conditions).
We developed Φ-Space, a computational framework for the continuous phenotyping of single-cell multi-omics data. In Φ-Space we adopt a highly versatile modelling strategy to continuously characterise query cell identity in a low-dimensional phenotype space, defined by reference phenotypes. The phenotype space embedding enables various downstream analyses, including insightful visualisations, clustering and cell type labelling.
We demonstrate through three case studies that Φ-Space (i) characterises develop-
ing and out-of-reference cell states; (ii) is robust against batch effects in both reference
and query; (iii) adapts to annotation tasks involving multiple omics types; (iv) over-
comes technical differences between reference and query.
The versatility of Φ-Space makes it applicable to a wide range analytical tasks
beyond cell type transfer, and its ability to model complex phenotypic variation will
facilitate biological discoveries from different omics types.
The Φ-Space package
Φ-Space is currently not directly available from the mixOmics package, instead it is a separate R package that can be installed from Github.
These two recordings were part of a presentation to WEHI for their postgraduate lecture series for a diverse audience.
In the PCA presentation (18 min), we explain the concept of linear combination of variables (components) and useful graphical outputs such as correlation circle plots and biplots.
In the PLS-DA presentation (7 min), we talk about the concept of multivariate signature.
If you want to know more about the actual algorithm under the hood, you can watch this webinar on PLS.
View this 50min video of Dr Saritha Kodikara presenting her method LUPINE:
We also have a second video presented by Prof Kim-Anh Lê Cao who sets LUPINE in the context of microbiome longitudinal data analysis, elaborating more on the types of analytical objects covered in Kodikara et al. (2022) Statistical challenges in longitudinal microbiome data analysis, Briefings in Bioinformatics.
Below you will also find the most common questions related to LUPINE.
FAQ:
Q: Do you build up the network from the covariance matrix or from the inverse covariance matrix? And what are you doing linear regression on?
A: The network is built on the partial correlation so it would be similar to the inverse covariance matrix. But instead of estimating the inverse covariance matrix, we calculate partial correlations through linear regression. To estimate the partial correlation between taxa a and taxa b, we regress their counts on the low dimensional representation of other taxa (excluding taxa a and b). This is then repeated for all pairs (we have an efficient way to do this computationally).
Q: You reduce the dimension of the data into one dimension. How much variance can be explained by the 1st component in your computation?
A: It depends on the data, but in the data we analysed, and if consider the single time point scheme only with PCA, the first component explained about 25% of the total variance. We could add more components into the regression but that may overfit the regression model. This is why we only select the first component, which explains much of the variance (for PCA, single time point) or covariance (for PLS, multiple time points).
Q: Do you think that this approach would work on single cell data trying to look at gene co expression in sort of longitudinal data in across time points?
A: It will not work with the present single cell technologies, because in LUPINE we need the same individuals/samples/cells across time to infer the association.
Q: When you do the linear regression, do you regress directly on the counts with all the zeros and the sparsity that you mentioned?
A: Yes, the method was originally developed for count data. We regress on the count data, but we also include library size as an offset to account for different library sizes. The method also works with center log ratio values, which I used to analyse the third case study.
Q: Do you apply your method for the two groups combined or separately?
A: I model each group separately as we assume that each group has a unique network.
Q: You’re building the networks building based on the partial correlations. What about the actual network for representation, do you actually binarize it?
A: Yes, I binarize the network based on a correlation test.
The LUPINE package
LUPINE is currently not directly available from the mixOmics package, instead it is a separate R package that can be installed from Github.
The latest version includes some recent updates (also covered in the other webinars in more details – check them out!)
The slides are opened to the community, but don’t forget to acknowledge the presenter if you are re-using the slides.
Multi-omics data (eg. transcriptomics, proteomics) collected from the same set of biospecimens or individuals is a powerful way to understand the underlying molecular mechanisms of a biological system.
mixOmics, a popular R package, integrates omics data from a wide range of sources into a single, unified view making it easier to explore and reveal interactions between omics layers. It overcomes many of the challenges of multi-omic data integration arising from data that are complex and large, with few samples (<50) and many molecules (>10,000), and generated using different technologies.
Prof Kim-Anh Lê Cao, head of the mixOmics team, is delivering this webinar to outline the different methods implemented in mixOmics and how statistical data integration is defined in this context. She will demonstrate how these approaches are applied to analysis of different multi-omics studies and outline the latest methodological developments in this area. From a study of human newborns, to multi-omics microbiomes, and multi-omics in single cells, these examples illustrate how mixOmics is used to perform variable selection and identify a signature of omics markers that characterise a specific phenotype or disease status.
Who the webinar is for:This webinar is for life scientists, bioinformaticians and anyone with an interest in exploration and integration of multiomics biological datasets.
Topics covered: omics data statistical integration, introduction to matrix factorisation techniques, applications of DIABLO and MINT frameworks for bulk or single cell assays, extensions.
The slides are opened to the community, but don’t forget to acknowledge the presenter if you are re-using the slides.