[BioC] edgeR problem

Wed Feb 27 11:14:50 CET 2013

Hi Nima,

The head of your data frame shows that you have both integers and non integers values (e.g. last sample of comp36723_c0_seq1). These non integers values are causing the bad estimation of BCV as well as the warnings. You can check how many non integers values you have with e.g. 

table(floor(count.matrix)==count.matrix)

Trinity is supposed to play well with  edgeR (see [1] ). How did you run trinity? 

[1] http://trinityrnaseq.sourceforge.net/analysis/diff_expression_analysis.html

On Feb 26, 2013, at 11:55 PM, Nima Rafati <nimarafati at gmail.com> wrote:

> Dear Hayssam,
> 
> Thanks for your reply. I have used trinity and following the instructions on their website:
> 1- I generated a matrix of counts from all 12 samples.
> 2- "Using the counts.matrix file created above, perform TMM normalization and generate the FPKM values per transcript and sample as follows:
> $TRINITY_HOME/Analysis/DifferentialExpression/run_TMM_normalization_write_FPKM_matrix.pl --matrix counts.matrix --transcript_lengths feature_lengths.txt" (from Trinity website).
> 3- Then I followed the codes from edgeR manual and ended up in high values which I had posted.
> 
> BUT I also tried the original count.matrix (raw counts data) without correction by using aforesaid script and received the same dispersion and BCV values.
> Here is the header of my count.matrix:
>         ERR162262       ERR162225       ERR162226       ERR162215       ERR162243       ERR162235       ERR162224       ERR162219       ERR162218       ERR1
> 62266       ERR162239       ERR162263
> Contig8320	43.00	71.00	44.21	39.00	35.00	25.00	18.00	19.92	28.00	28.00	7.00	37.00
> comp28560_c2_seq1-len=504       239.00  231.00  239.00  214.00  223.00  155.00  211.00  203.00  212.00  225.00  11.00   294.00
> comp36723_c0_seq1-len=635       83.67   79.02   38.28   52.13   72.07   27.00   46.88   55.23   51.12   46.00   24.50   63.12
> comp24326_c0_seq2-len=1093      18.00   16.00   9.00    23.00   30.00   18.00   30.00   28.00   12.00   17.00   70.00   42.00
> 
> You also asked about the warnings:
> Warning messages:
> 1: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>   non-integer x = 0.940000
> 2: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>   non-integer x = 0.500000
> 3: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>   non-integer x = 0.010000
> 4: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>   non-integer x = 0.410000
> 5: In dnbinom(y, size = 1/dispersion, mu = mu, log = TRUE) :
>   non-integer x = 1.740000
> 
> I appreciate your help,
> Regards,
> Nima
> 
> On Tue, Feb 26, 2013 at 11:01 PM, Hayssam Soueidan <h.soueidan at nki.nl> wrote:
> Hi Nima,
> 
> 
> I never had such high value for the BCV. In my analysis (mouse and human RNA-Seq), the BCV is usually way below 1. From the name of your data file, it looks like you have normalized FPKM data. EdgeR expect raw counts data (integers). That might be causing problems. Could you provide a head of your data.TMM data.frame?
> Further could you 1) provide a session.info and 2) provide some of the warnings?
> 
> Regards,
> Sam.
> 
> On Feb 26, 2013, at 4:39 PM, Nima Rafati <nimarafati at gmail.com> wrote:
> 
> > Dear all,
> >
> > I have RNA-seq libraries of 12 individuals in two groups (6 replicates
> > each). I would like to do differential expression analyses using a GLM with
> > effect of group and sex on the transcripts. I followed the manual and in
> > last step for calculation of dispersion (estimateGLMCommonDisp)  I received
> > a high value with a warning. Here comes all commands that I have used:
> >
> > data.TMM<-read.table("Mod-H-transcripts.ount.matrix.TMM_normalized.FPKM",row.names=1,header=T)
> > sex<-factor(c("M","F","M","F","M","F","M","F","M","F","F","M"))
> > grp<-factor(c("W","W","W","W","W","W","D","D","D","D","D","D"))
> > y.TMM<-DGEList(count=data.TMM.new,group=group.D.W)
> > data.frame(Sample=colnames(y.TMM),grp,sex)
> > design<-model.matrix(~grp+sex)
> > rownames(design)<-colnames(y.TMM)
> > y.TMM <- estimateGLMCommonDisp(y.TMM, design, verbose=TRUE)
> >
> > Disp = 3.99994 , BCV = 2
> > There were 50 or more warnings (use warnings() to see the first 50)
> >
> > Despite of error, is the generated dispersion reliable? can I continue with
> > analyses?
> > Best regards,
> > Nima
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioconductor mailing list
> > Bioconductor at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
>