[BioC] RNAseq machine learning classifier

jhua at tgen.org jhua at tgen.org
Wed Jul 17 20:16:11 CEST 2013


This sounds an OK approach to me. 

One thing you might take into consideration is that the classifier design usually involves independent validation data.  If you are going to validate your classifier with the same type of RNAseq data, in general you need to normalize/variance stabilize all of them in one cohort.  But sometimes the validation data are not collected until I report really positive results on training data only, which end up with another round of full normalization, training, and testing...  

Jianping Hua, Ph. D.
Research Assistant Professor
Computational Biology Division
Translational Genomics Research Institute (TGen)



> 
> Steve!
> 
> I was thinking along these same lines: estimating dispersions then using a
> variance stabilizing transformation. However, I am not sure how proper this
> approach is?
> 
> Can anyone confirm the validity of this approach?
> 
> Michael
> 
> 
> On Mon, Jul 15, 2013 at 3:58 PM, Steve Lianoglou
> <lianoglou.steve at gene.com>wrote:
> 
>> Hi,
>> 
>> On Mon, Jul 15, 2013 at 2:42 PM, Michael Breen
>> <breenbioinformatics at gmail.com> wrote:
>>> Hi all,
>>> We have a large RNAseq data set. Apart from identifying differentially
>>> expressed genes with these data we are also interested in classification
>> in
>>> terms of developing a pronostic and diagnostic classifier.
>>> 
>>> Normally, our approach would utilize a machine learning classifier, as
>> SVM,
>>> and typically proceed with a nested cross-validation approach.
>>> 
>>> 
>>> The vast majority of these programs and packages have been designed
>>> utilizing microarray data.
>>> 
>>> Are there any reasonable biases which one should consider before using
>> such
>>> already published approaches on RNAseq data?
>>> 
>>> Do the distributions of the different data types matter at all?
>>> 
>>> If so, does an application exist using an SVM taking into consideration
>>> RNAseq raw counts?
>> 
>> One approach would be to take the output from one of the variance
>> stabilizing transformations in DESeq2 as the input to your machine
>> learning approach.
>> 
>> See:
>> 
>> R> library(DESeq2)
>> R> ?varianceStabilizingTransformation
>> 
>> and the Section 7 of the DESeq2 vignette (count data transformations):
>> 
>> 
>> http://bioconductor.org/packages/release/bioc/vignettes/DESeq2/inst/doc/DESeq2.pdf
>> 
>> HTH,
>> -steve
>> 
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>> 



More information about the Bioconductor mailing list