Glossary – mixOmics

The terminology used in mixOmics can be confusing, especially for newcomers. This glossary provides key definitions to help understand the methods and their outputs.

Input data basics:

Datasets

One dataset contains measurements from a single modality (e.g. transcriptomics). If you have multiple modalities (e.g. transcriptomics and metabolomics), these should be organised in separate tables. Datasets are made up of samples and features, samples should be in the rows and features in the columns.

Samples, Individuals, N

Samples, also called Individuals or simply ‘N‘ make up the rows of your dataset. Samples/Individuals are the experimental units on which information is collected (e.g. patients, mice, cell lines, faecal samples, etc). Samples can be grouped into classes.

Features, Variables, P

Features, also called Variables, or simply ‘P‘ make up the columns of your dataset. Features represent the various different things that were measured for each sample (eg. genetic expression, protein levels, etc).

Continuous vs Categorical

Continuous variables are numeric (e.g. the expression of a gene), whilst categorical variables cannot be placed on a continuous scale (e.g. the sex of a patient).

Classes

Samples can be grouped into classes e.g. sex, tumour classification group, geographical location. Classes can be included in the data as a single column of made of a categorical variable. In mixOmics, this column will be coded as a ‘dummy matrix’ of numbers.

Multilevel

An experimental design with repeated measurements and/or paired data (e.g. collecting samples from the same patient at two timepoints or from two parts of the body) will result in multilevel data. These datasets require multilevel analysis to account for variation between individuals, which may be greater than the variation between your timepoints/samples/etc.

Dimensionality reduction:

Components, Variates

Components are artificial variables built from a linear combination of original variables. The way they are created depends on the method, e.g. in PCA components are created to maximise variance. Components provide a new way to represent the samples e.g. in sample plots. In methods like (s)PCA, these components will be orthogonal (perpendicular). This is not guaranteed using other methods.
Variates are another name for components but is really only used in (r)CCA. In this context, the components are called canonical variates.

Loadings

As described above, components are linear combinations of features. Loadings represent the weights (or coefficients) assigned to each of the features to determine their contribution to a given component. They can be visualised with a Loadings Bar Plot.

Describing data and relationships between datasets:

Sparse (data)

Generally, a set of values which contains many zeroes is referred to as sparse. A sparse dataset is one where a large portion of the measurements are zero.

Variance

Measure of the spread of one variable. High variance indicates that data points are spread out from the mean and from one another.

Association

This is a very broad umbrella term which can refer to any type of relationship between a given set of variables and/or components.

Covariance and correlation

Covariance is the measure of the strength of the relationship between two features, i.e. whether they covary. A high covariance indicates a strong relationship e.g. weight and height in individuals frequently vary roughly in the same way (the heaviest are the tallest). A covariance value has no lower or upper bound.
Correlation is a standardised version of covariance that is always bounded between -1 and 1.

Canonical correlation

While correlation can be used when talking about the association between any type of variable or component, canonical correlation is used specifically when talking about the novel components that were generated by a given method.

mixOmics methods:

Sparse (method)

The sparse variant of a certain method (eg. sPCA is the sparse version of PCA) means that only a subset of optimally selected variables are used, or in other words the loadings of a majority of variables are reduced to zero. These methods are useful if you want to identify which variables are the most important in your dataset.

Supervised vs Unsupervised

A supervised model predicts a response variable (or multiple) using a set of predictors (e.g. PLS-DA, PLS Regression mode). In an unsupervised model there isn’t any response variable, rather the model is purely for dimension reduction and exploration of the data (e.g. PCA, PLS Canonical mode).

Classification vs Regression

Supervised models can be further classified as classification or regression. Classification models predict categorical outcomes (e.g. PLS-DA), whilst regression models predict continuous outcomes (e.g. PLS in regression mode). Classification methods always have a ‘-DA‘ on the end of their name, this means ‘discriminative analysis’.