Page from R Markdown

Missing_Values.knit

All methodologies implemented in mixOmics can handle missing values.
In particular, (s)PLS, (s)PLS-DA,
(s)PCA utilise the NIPALS
(Non-linear Iterative
Partial Least
Squares) algorithm as part of their dimension reduction
procedures. This algorithm is built to handle NAs [1].

This is implemented through the nipals() function within
mixOmics. This function is called internally by the above methods but
can also be used manually, as can be seen below.

Usage in mixOmics

library(mixOmics)
data(liver.toxicity)
X <- liver.toxicity$gene[, 1:100] # a reduced size data set

## pretend there are 20 NA values in our data
na.row <- sample(1:nrow(X), 20, replace = TRUE)
na.col <- sample(1:ncol(X), 20, replace = TRUE)
X.na <- as.matrix(X)

## fill these NA values in X
X.na[cbind(na.row, na.col)] <- NA
sum(is.na(X.na)) # number of cells with NA

## [1] 20

# this might take some time depending on the size of the data set
nipals.tune = nipals(X.na, ncomp = 10)$eig
barplot(nipals.tune, xlab = 'Principal component', ylab = 'Explained variance')

FIGURE 1: Column graph of the explained variance of each Principal
Component.

If missing values need to be imputed, the package contains
impute.nipals() for this scenario. NIPALS
is used to decompose the dataset. The resulting components, singular
values and feature loadings can be used to reconstitute the original
dataset, now with estimated values where the missing values were
previously. To allow for the best estimation of missing values, there is
a large number of components being used (ncom = 10).

X.impute <- impute.nipals(X = X.na, ncomp = 10)
sum(is.na(X.impute)) # number of cells with NA

## [1] 0

The difference between the imputed and real values can be checked.
Here are the original values:

id.na = is.na(X.na) # determine position of NAs in dataframe

X[id.na] # show original values

##  [1]  0.09041 -0.04070  0.03497 -0.01712  0.01309  0.00233 -0.04142  0.11104
##  [9] -0.01519 -0.17034 -0.01641  0.15964  0.00557 -0.06217  0.04131  0.02157
## [17]  0.01226 -0.00753  0.03038 -0.00783

The values which were estimated via the NIPALS
algorithm:

X.impute[id.na] # show imputted values

##  [1]  0.0837747419 -0.0190061068  0.0004024897 -0.0180879247 -0.0094185656
##  [6] -0.0312362158 -0.0706920015  0.1400817774  0.0083359545 -0.1158255139
## [11]  0.0164817649  0.1007897385  0.0236184385  0.0191934144  0.0214240977
## [16]  0.0686280312 -0.0039198425  0.0085870558  0.0450234407  0.0013964758

References

Wold,
H. (1973). Nonlinear Iterative Partial Least Squares (NIPALS) Modelling:
Some Current Developments. Multivariate Analysis–III, 383-407.
https://doi.org/10.1016/b978-0-12-426653-7.50032-6

Test Post from R

This is a test post created via the REST API using R. It supports HTML formatting!

[closed] Online workshop (on-demand)

This workshop will only be run for a specific group of participants. Other online courses will be announced soon!

We will ask you to fill the internal survey so that we can tweak the course accordingly. Do not forget to lock in the dates already in your calendar!

Context. Advances in high-throughput technologies have transformed the way we examine molecular information. However, analytical tool development is critically trailing behind data generation, which hinders the analysis, understanding or integration of omics data. Data integration adopts a holistic, data-driven and hypothesis-free approach. This new approach is necessary to understand the role of biological systems and posit new hypotheses.

This online workshop will introduce concepts of multivariate dimension methods developed in mixOmics for statistical analysis. Our methods make no distributional assumptions, are highly flexible for unsupervised (exploratory), supervised (classification) and integration analyses. Various analytical frameworks will be presented ranging from data exploration, selection of markers, integration with other omics datasets and introduction to time-course analysis. There will be an opportunity also to talk about the analysis of microbiome data and time-course data.

Each methodology will be illustrated on real biological studies during a short hands-on session in R. You can also bring your own data to analyse your data on the spot using the R scripts that we will provide. The workshop will cover general omics data integration concepts with appropriate case studies.

Instructors: A/Profs Sébastien Déjean (University of Toulouse, sessions 1-3) and Kim-Anh Lê Cao (University of Melbourne, sessions 4-5)

Material includes lecture notes, slides, R code, and data.

Bring your own data. Participants will be given the opportunity to analyse their own data using the R codes provided. We will give specific instructions on how to process and format the data. Participants can also work in a team. Some data sets will also be provided for those unable to bring their own data.

Dates for the five sessions (approx 2h per session):

Sept 21st, 23th, 28th 9-11am EST / 9-11pm Singapore (same day for all)

Sept 30th, Oct 5th 6-8pm EST / 6-8am Singapore (+1 day)

Contact: mixomics[ at] math.univ-toulouse.fr (for pre-requisite or content)

Prerequisite and requirements. We require from the trainees a good working knowledge in R programming (e.g. handling data frame, perform simple calculations and display simple graphical outputs) to fully benefit from the workshop. Participants are requested to bring their own laptop, having installed the software RStudio http://www.rstudio.com/and the R package mixOmics (instructions will be provided prior to the training).

Outline

Each session is 2h length, roughly divided into 50min presentation, 10 min break and 1h hands-on with recap at the end of the session.

Session 1: PCA and sparse PCA 101

We will start with the basics that are necessary to understand the more complicated concepts!

Session 2: PLS-Discriminant Analysis

We will move on to discriminant analysis, to separate sample groups and identify molecular signatures. The hands-on session can include your own data. (*PLS = Projection to Latent Structures / Partial Least Squares)

Session 3: integration of two data sets with PLS and CCA

This session will also introduce useful graphics to visualise the results of those methods. BYO data welcome. (CCA = Canonical Correlation Analysis)

Session 4: multi-omics data integration with block PLS (DIABLO)

Building up on the previous sessions, we will cover multiblock PLS-DA with additional numerical and graphical outputs. You will anlyse BYO data (if you have already analysed your data with the previous methods) or data provided in the package.

Session 5: various methodological extensions

This more theoretical session will cover recent methodological developments ‘around’ (but not necessarily ‘in’) mixOmics, from compositional data analysis (for microbiome studies), batch effect management to P-integration and time-course omics data exploration (topics chosen according to your needs). This session will not include hands-on on session but relevant R code / vignettes will be hand out.

The following statistical concepts will be introduced: covariance and correlation, multiple linear regression, classification and prediction, cross-validation, selection of markers, penalised regressions. Each methodology will be illustrated on a case study (theory and application will alternate).

Target group The course is intended for microbiologists working in the fields of bioinformatics, computational biology and applied statistics with some statistical knowledge and a good working knowledge in R. It will be particularly useful to those interested in:

Exploring data sets.
Selecting molecular / microbial features with methods implementing LASSO-based penalisations.
Using graphical techniques to better visualise data.
Understanding and/or applying multivariate projection methodologies to large data sets.

Anticipated learning outcomes After completion of this workshop, participants will be able to

Understand fundamental principles of multivariate projection-based dimension reduction technique.
Perform statistical integration and feature selection using recently developed multivariate methodologies.
Apply those methods to high throughput microbiome studies, including their own studies.

Version 6.1.0 and latest publications

We are proud to announce our new update 6.1.0 available on CRAN. It was supposed to be a small patch but we got slightly ahead of ourselves. Special thanks to the mixOmics French’Oz developers, Dr Florian Rohart (University of Queensland, Brisbane) and Mr François Bartolo (Université de Toulouse, France), as well as several users who have been using our latest methods and reported bugs or suggested improvements on our bitbucket issue website.

Manuscripts and publication update

Rohart F., Matigian N., Eslami A., Bougeard S and Lê Cao, K. A..MINT: A multivariate integrative method to identify reproducible molecular signatures across independent experiments and platforms. Now available on bioRxiv!
Singh A, Gautier B, Shannon C, Vacher M, Rohart F, Tebbutt S, K-A. Lê Cao. DIABLO – multi-omics data integration for biomarker discovery. Manuscript available in bioRxiv.
K-A. Lê Cao*, ME Costello*, VA Lakis, F Bartolo, XY Chua, R Brazeilles, P Rondeau. (2016) MixMC: Multivariate insights into Microbial Communities.PLoS ONE 11(8): e0160169 [link]

List of changes in mixOmics 6.1.0 (in NEWS file)

In short,
– cimDIABLO argument ‘corThreshold’ replaced by ‘cutoff’
– new plots of tune and perf results now available
– tune function for block.splsda/DIABLO method
– auroc for supervised methods

New features:

1- auroc function applicable for (mint).(block).(s)plsda objects. AUc values also included in perf and tune functions (except mixDIABLO module)
2- tune.block.splsda function to chose the keepX parameters of block.splsda (a.k.a mixDIABLO)
3- plot for perf objects displays the classification error rate w.r.t components
4- plot for tune objects displays the classification error rate w.r.t keepX values (not implemented for tune.block.splsda)
5- multilevel function has been removed (as planned) as it is now included as an argument in other functions (see pca, pls, splsda, etc)

Enhancements:
1 – All tune functions (except for mixDIABLO/block.splsda module) include a ‘constraint’ argument to either build the model based on user input specific parameters (object$keepX.constraint) or based on the optimal parameter keepX determined by the tune function, see examples in help files.
2 – All perf functions (except for mixDIABLO/block.splsda module) have now a ‘constraint’ argument that allows the performances to be calculate either based on the number of parameters (object$keepX) defined in object or based on the variables selected on each component, see examples in help files.
3 – max.iter has been set to 100 to speed up computational time for all multivariate methods except pca/spca.
4 – cimDiablo: new arguments include transpose, row.names and col.names
5 – circosPlot: new arguments include var.names and comp. Argument ‘corThreshold’ has been replaced by ‘cutoff’.
6 – plotIndiv: new argument legend.title
7 – network function for block.spls(da) models and allows to plot for more than 2 blocks
8 – PCA: new argument ilr.offset to be used only for ILR log transform in PCA (mixMC module)
9 – Legend added in plotDiablo, new argument legend.ncol

Bug fixes:
1 – plotIndiv and ellipse: plot ellipse for all groups with more than 1 sample
2 – predict function: argument multilevel added, log transform included
3 – Call to plsda.vip() from the RVAideMemoire package
4 – other small bugs as listed in out bitbucket issues, matching rgl package changes.

Sept 24-25 2015, Jouy-en-Josas, FR

Date: 24-25 September 2015, 9.30am – 5.30pm

Venue: INRA Jouy-en-Josas, Allée de Vilvert, 78352 Jouy-en-Josas, France

‘paper, pdf and slide great, excellent pedagogy‘

‘A lot of informations were brought and must be adapted on our complex data. Workshop was well organized and the speech was really clear even if some things were not easy to understand when we don’t know all the statistical terms. But Kim-Ahn was very available for more explanations. Thank you very much. A lot of new ideas for data treatments… :)‘