[BioC] MLSeq Mathematical Concepts

Wed Apr 23 14:36:33 CEST 2014

Dear Dario,

In our experiments from both simulated and real RNASeq data (under review
in Bioinformatics), we have found that deseq normalization followed by the
vst transformation improves the performance of classifiers, mostly for SVM
and PLDA (poisson linear discriminant analysis).

For the voom transformation, MLSeq currently uses the cpm values. The name
of this argument will be updated as “voom-cpm” instead of “voom”. Yes,
specialized classification and clustering algorithms are needed to combine
the cpm values and voom weights. But at this moment, deseq+vst+traditional
classifiers or tmm+voom-cpm+traditional classifiers are the current
solutions for RNASeq based gene-expression classification.

Best,

Gokmen Zararsiz

On Apr 23, 2014, at 12:27 AM, Bernd Klaus <bernd.klaus at embl.de> wrote:

Dear Dario,

I think you are right about being careful to simply use the voom  weights
to pre-transform the data. As Dr. Smyth pointed out a while ago, an
algorithm
should always use these weights explicitly in some way rather than using
them to
pretransform the data. 

You could possibly incorporate them easily in a DDA classifier for
example. 

Apart from Wolfgangs links, I might point you to two interesting papers

Zwiener et. al. - Transforming RNA-Seq Data to Improve the Performance of
Prognostic Gene Signatures
[1]
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0085150

They investigate a couple of  pretransformations for RNA-Seq data
classification and find
that rank based transformation perform well in general. (They do not
consider voom weights)

[2] Gallopin et.al. - A Hierarchical Poisson Log-Normal Model for Network
Inference from RNA Sequencing Data
http://www.plosone.org/article/info%3Adoi%2F10.1371%2Fjournal.pone.0077503

They use a GLMM combined with a lasso penalty to incorporate unequal
sample variances and then
estimate a graphical model using a type of partial correlation. 

This is somewhat similar to the voom approach, however the variances and
the model parameters are estimated in "one-go". However, they note that
the 
algorithm used is very slow.

Best wishes,

Bernd

On Apr 23, 2014, at 9:43 AM, Wolfgang Huber <whuber at embl.de> wrote:

> Dear Dario
> 
> good points, and as usual in machine learning, I don’t expect there to be a simple answer or universally best solution.
> For classification, the (pre)selection  of features (genes) used is probably more important than most other choices, esp. if the classification task is simple and can be driven by a few genes. For clustering, similar, plus the choice of distance metric or embedding.
> 
> That said, it is plausible that both, using the untransformed counts (or RPKMs etc.), or the log-transformed values, have problems with high variance (either at the upper or lower end of the dynamic range) that can be avoided with a different transformation, log-like for high values, linear-like for low (e.g. DESeq2’s vst, rlog). Paul McMurdie and Susan Holmes have some on this in their waste-not-want-not paper [1], and Mike in a Supplement to the DESeq2 paper (draft). It would be interesting to collect more examples, and someone should probably study this more systematically (if they aren’t already.)
> 
> Kind regards
> 	Wolfgang
> 
> 
> [1] http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.1003531
> [2 http://www-huber.embl.de/DESeq2paper —> Regularized logarithm for sample clustering (As of today, there is a version of 19 February which I think will soon be updated with a more extensive survey).
> 
> 
> 
> 
> Il giorno 23 Apr 2014, alle ore 07:00, Dario Strbenac <dstr7320 at uni.sydney.edu.au> ha scritto:
> 
>> Hello,
>> 
>>> From reading the vignette, MLSeq seems to be a set of wrapper functions that allows the user easy access to normalisation strategies in edgeR or DEseq and passes the data onto algorithms such as Support Vector Machine or Random Forest. Are there any results that demonstrate that normalisation improves classification performance ? I am also not convinced about the description of using voom weights to transform the data. The author of voom stated that specialised clustering and classification algorithms are needed to handle the CPM and weights separately. Why does MLSeq use standard classification algorithms and how were the weights and expression values combined ?
>> 
>> --------------------------------------
>> Dario Strbenac
>> PhD Student
>> University of Sydney
>> Camperdown NSW 2050
>> Australia
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor