The terminology used on this site and elsewhere in relation to various methods available in
mixOmics can sometimes be confusing, especially for new-comers. This page aims to provide a few basic definitions and distinctions to aid in understanding usage and output of our methods.
Variables and Features
Prior to any analysis using
mixOmics, you will be in possession of one or more sets of data. These will be made up of samples and variables. The variables should make up the columns of your data and represent the various different things that were measured for each sample (eg. genetic expression, protein levels, etc). Variables and features refer to the same thing – the original measurements that will be inputted into the
The concept of components is one of the most central to the
mixOmics package. It refers to the novel axes which are generated as part of the dimension reduction process. They are linear combinations of the inputted features and represent a new space for the samples to be projected into. In methods like (s)PCA, these components will be orthogonal (perpendicular). This is not guaranteed using other methods.
Variates are essentially synonymous with components but is really only used in a ®CCA context. The novel components yielded via ®CCA are called canonical variates.
As described above, components are linear combinations of features. Loadings represent the weights (or coefficients) assigned to each of the features to determine their contribution to a given component.
This is a very broad umbrella term which can refer to any type of correlation or relationship between a given set of variables and/or components.
While correlation can be used when talking about the association between any type of variable or component, canonical correlation is used specifically when talking about the novel components that were generated by a given method.
Generally, a set of values which contains many zeroes is referred to as sparse. Practically, this term has two closely related meanings. A sparse dataset is one where a large portion of the measurements are zero. The sparse variant of a certain method (eg. sPCA is the sparse version of PCA) means that only a subset of optimally selected variables are used, or in other words the loadings of a majority of variables are reduced to zero.
Supervised vs Unsupervised
This concept extends out into all modelling and machine learning practices. Simply, a supervised model is one where it is generated in order to predict a response variable (or multiple) using a set of predictors. (s)PLS-DA and (s)PLS Regression are prime examples of this. An unsupervised model is one where there isn't any response that is being attempted to be predicted, rather the model is purely for dimension reduction and exploration of the data. (s)PCA is the most intuitive case.
In the case of classification models, the categorical response variable will have a certain number of levels, or classes. In the case there are only two classes, this is referred to as a binary classification scenario. When there are more than two classes, it is referred to as multiclass. This framework introduces new concepts into assessing classification performance.