[BioC] Limma - RNA-Seq DE genes - Quantile normalized log transformed RPKM data

Gordon K Smyth smyth at wehi.EDU.AU
Sun Dec 4 01:38:18 CET 2011


Dear Avinash,

Briefly, you have used

  design <- model.matrix(~0+Group)

when you needed

  design <- model.matrix(~Group)

However, I do not recommend a limma analysis of RPKM values, because it 
does not respect the mean-variance relationship inherent in the count 
data.  Nicole Cloonan's excellent Nature Methods paper was written four 
years ago, and our understanding of the statistical analysis RNA-Seq data 
has moved on considerably since then.  Please see the last case study in 
the limma User's Guide, which analyses the Nigerian HapMap data, for how I 
recommend limma be used to analyse RNA-Seq data.

Best wishes
Gordon


> Date: Fri, 2 Dec 2011 10:31:26 -0600
> From: Avinash S <avins.s at googlemail.com>
> To: bioconductor at r-project.org
> Subject: [BioC] Limma - RNA-Seq DE genes - Quantile normalized log
> 	transformed RPKM data
>
> Hello All,
>
> I'm trying to understand the method of differential expression analysis
> described in :
> *
> *
> *Stem cell transcriptome profiling via massive-scale mRNA sequencing*
> *Nicole Cloonan et al*
> *NATURE METhODS | VOL.5 NO.7 | JULY 2008 | 613*
> http://www.nature.com/nmeth/journal/v5/n7/abs/nmeth.1223.html
> *Section: Statistical Analysis *
> To calculate differential expression of SQRL tag data we analyzed the
> normalized gene signals (tags per Refseq transcript, length-normalized) for
> each library using the Limma package in R. After Quantile normalization, we
> used Limma to fit a linear model to the log2-transformed data using an
> empirical Bayes method32 to moderate standard errors. Differentially
> expressed genes were defined as those with a B statistic > zero.
>
> I have quantile normalized log2-transformed RPKM data and I wanted to find
> DE genes based on B statistic and log2foldchange.
> *Sample Rawdata:*
>
> GeneModel CON1 CON2 TR1 TR2
> 1s00200.1 2.723945276 3.775939811 3.623211245 3.717795434
> 1s00210.1 4.999354495 3.738129253 3.268468778 3.822220668
> 1s00220.1 1.450861588 1.265013193 0.942706046 0.744551693
>
> I'm using the following R-script:
>
> library(limma)
> raw.data <-
> read.delim("INPUT-QuantileNormalizedLog2Transformed-RPKM-Data.txt")
> attach(raw.data)
> names(raw.data)
> d <- raw.data[, 2:5]
> rownames(d) <- raw.data[, 1]
> Group <- factor(c("CON","CON","MYC","MYC"), levels=c("CON","MYC"))
> design <- model.matrix(~0+Group)
> colnames(design) <- c("CON","MYC")
> fit <- lmFit(d, design)
> fit <- eBayes(fit)
> *Warning message:*
> *Zero sample variances detected, have been offset*
> topTable(fit,coef=2,number=1000,genelist=fit$genes)
> - THIS LISTS THE GENES ALL POSITIVE LOGFC VALUES - NON -NEGATIVE.
>
> R version 2.14.0 (2011-10-31)
> Platform: i386-apple-darwin9.8.0/i386 (32-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] limma_3.10.0
>
>
> Can someone please let me know what I'm doing wrong here. Is it in the
> array design or input data?
>
> Thank you,
> Avinash

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list