[BioC] Problems with edgeR for differential expression

Gordon K Smyth smyth at wehi.EDU.AU
Sat Oct 8 03:02:31 CEST 2011


Hi Nick,

We haven't seen anything like this.  Here are some suggestions.

Given your experimental layout with multiple groups but no covariates, you 
could use the "classic" edgeR functions:

   estimateCommonDisp(y)
   exactTest()
   topTags()

rather than the glm code.  This would allow you to compare any two of your 
groups.  What results does this give?

Can you try updating to edgeR 2.3.52 (the current devel version) to see it 
makes a difference?  (There are many changes, but one is that exactTest() 
is much faster now.)

Thanks for including output from your objects, but I find that I can't get 
much information from the str(y) output.  Could you please use the show 
method for these objects, e.g., show(y)?

Finally, would you be willing to share some of your date with us offline 
so that we can trouble-shoot?  As I said, we haven't seen this behaviour.

Best wishes
Gordon

> Date: Fri, 7 Oct 2011 10:10:21 +0100
> From: Nick Schurch <N.Schurch at dundee.ac.uk>
> To: bioconductor at r-project.org
> Subject: [BioC] Problems with edgeR for differential expression.
>
> Dear Bioconductors,
>
> I'm trying to use edgeR to compute the differential expression from some 
> RNA-Seq data, but it seems to be generating some odd results.
>
> I'm running edgeR on data on about 6000 genes, across 5 experimental 
> conditions. When I compute the differential expression for any two of 
> these conditions edgeR it is returning a 'nan' fold-change for about 
> 1500 out of the 6000 genes being tested. Amazingly, it is also returning 
> p-values for these fold-changes, and the p-values cover a range of 
> values from total insignificant (>0.5) to very significant (<1E-4)! At 
> first I thought if might be because there is sometimes no signal for a 
> gene in a given condition, but 1) other cases of this nature produce 
> '-inf' or 'inf' fold-changes, not 'nan' fold-hanges, and 2) in some 
> cases edgeR is calculating a 'nan' fold-change for something that has 
> signal in every replicate of both conditions! I've checked everything I 
> can think of... I don't get any errors or warnings, the Norm Factors are 
> sensible, the raw signal is sensible, all the objects look well-formed 
> (i.e. like they contain all the bits they should contain).... its very 
> confusing and frustrating.
>
> So, am I doing something wrong? Or is there a deeper problem with this
> package?
>
> I'm using R v 2.13.1, edgeR 2.2.5 and the commands I'm using are:
>
>> # create the design matrix
>> model.groups<-groups
>> model.factors<-as.factor(model.groups)
>> model<-model.matrix(~model.factors)
>>
>> # build DGElist and calculate normalization factors
>> x=as.data.frame(data)
>> rownames(x)=genenames
>> y=DGEList(x,group=groups)
>> y=calcNormFactors(y)
>> str(y)
>
> Formal class 'DGEList' [package "edgeR"] with 1 slots
>  ..@ .Data:List of 3
>  .. ..$ :'data.frame': 17 obs. of  3 variables:
>  .. .. ..$ group       : Factor w/ 5 levels "cond00","cond06",..: 3 3 3 4 4
> 4 3 3 1 1 ...
>  .. .. ..$ lib.size    : num [1:17] 1866963 2364994 1712838 2782920 2780054
> ...
>  .. .. ..$ norm.factors: num [1:17] 0.992 0.978 1.002 1.036 0.972 ...
>  .. ..$ : num [1:6351, 1:17] 7 187 44 0 98 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:17] "cond07.rep0020" "cond07.rep0021"
> "cond07.rep0022" "cond08.rep0023" ...
>  .. ..$ : Named logi [1:6351] FALSE FALSE FALSE FALSE FALSE FALSE ...
>  .. .. ..- attr(*, "names")= chr [1:6351] "38536" "38537" "38538" "38539"
> ...
>
>>
>> # estimate dispersion and fit models
>> z=estimateGLMCommonDisp(y, design))
>> str(z)
>
> Formal class 'DGEList' [package "edgeR"] with 1 slots
>  ..@ .Data:List of 4
>  .. ..$ :'data.frame': 17 obs. of  3 variables:
>  .. .. ..$ group       : Factor w/ 5 levels "cond00","cond06",..: 3 3 3 4 4
> 4 3 3 1 1 ...
>  .. .. ..$ lib.size    : num [1:17] 1866963 2364994 1712838 2782920 2780054
> ...
>  .. .. ..$ norm.factors: num [1:17] 0.992 0.978 1.002 1.036 0.972 ...
>  .. ..$ : num [1:6351, 1:17] 7 187 44 0 98 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:17] "cond07.rep0020" "cond07.rep0021"
> "cond07.rep0022" "cond08.rep0023" ...
>  .. ..$ : Named logi [1:6351] FALSE FALSE FALSE FALSE FALSE FALSE ...
>  .. .. ..- attr(*, "names")= chr [1:6351] "38536" "38537" "38538" "38539"
> ...
>  .. ..$ : num 0.101
>
>>
>> fit<-glmFit(z,model,dispersion=z$common.dispersion)
>> str(fit)
>
> Formal class 'DGEGLM' [package "edgeR"] with 1 slots
>  ..@ .Data:List of 12
>  .. ..$ : num [1:6351, 1:5] -12.89 -9.22 -10.72 -11.54 -9.56 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:5] "(Intercept)" "model.factorscond06"
> "model.factorscond07" "model.factorscond08" ...
>  .. ..$ : int [1:6351] 12 12 12 12 12 12 12 12 12 12 ...
>  .. ..$ : Named num [1:6351] 23.3 3.45 30.57 25.89 3.07 ...
>  .. .. ..- attr(*, "names")= chr [1:6351] "38536" "38537" "38538" "38539"
> ...
>  .. ..$ : num [1:17, 1:5] 1 1 1 1 1 1 1 1 1 1 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:17] "1" "2" "3" "4" ...
>  .. .. .. ..$ : chr [1:5] "(Intercept)" "model.factorscond06"
> "model.factorscond07" "model.factorscond08" ...
>  .. .. ..- attr(*, "assign")= int [1:5] 0 1 1 1 1
>  .. .. ..- attr(*, "contrasts")=List of 1
>  .. .. .. ..$ model.factors: chr "contr.treatment"
>  .. ..$ : num [1:6351, 1:17] 14.4 14.4 14.4 14.4 14.4 ...
>  .. ..$ :'data.frame': 17 obs. of  3 variables:
>  .. .. ..$ group       : Factor w/ 5 levels "cond00","cond06",..: 3 3 3 4 4
> 4 3 3 1 1 ...
>  .. .. ..$ lib.size    : num [1:17] 1866963 2364994 1712838 2782920 2780054
> ...
>  .. .. ..$ norm.factors: num [1:17] 0.992 0.978 1.002 1.036 0.972 ...
>  .. ..$ : NULL
>  .. ..$ : num 0.101
>  .. ..$ : num [1:17] 1851365 2313681 1716797 2882487 2703330 ...
>  .. ..$ : NULL
>  .. ..$ : num [1:6351, 1:17] 6.32 160.53 59.13 12.7 105.68 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:17] "cond07.rep0020" "cond07.rep0021"
> "cond07.rep0022" "cond08.rep0023" ...
>  .. ..$ : Named num [1:6351] -12.67 -9.33 -10.36 -11.66 -9.64 ...
>  .. .. ..- attr(*, "names")= chr [1:6351] "38536" "38537" "38538" "38539"
> ...
>
>>
>> # liklihood ratio statistics
>> results=glmLRT(z, fit, coef = 4)
>> str(results)
>
> Formal class 'DGELRT' [package "edgeR"] with 1 slots
>  ..@ .Data:List of 10
>  .. ..$ :'data.frame': 17 obs. of  3 variables:
>  .. .. ..$ group       : Factor w/ 5 levels "cond00","cond06",..: 3 3 3 4 4
> 4 3 3 1 1 ...
>  .. .. ..$ lib.size    : num [1:17] 1866963 2364994 1712838 2782920 2780054
> ...
>  .. .. ..$ norm.factors: num [1:17] 0.992 0.978 1.002 1.036 0.972 ...
>  .. ..$ : Named logi [1:6351] FALSE FALSE FALSE FALSE FALSE FALSE ...
>  .. .. ..- attr(*, "names")= chr [1:6351] "38536" "38537" "38538" "38539"
> ...
>  .. ..$ : num 0.101
>  .. ..$ :'data.frame': 6351 obs. of  4 variables:
>  .. .. ..$ logConc     : num [1:6351] -12.67 -9.33 -10.36 -11.66 -9.64 ...
>  .. .. ..$ logFC       : num [1:6351] 0.7057 -0.0157 0.5583 -0.3311 -0.1913
> ...
>  .. .. ..$ LR.statistic: num [1:6351] 1.38671 0.00165 1.83369 0.46732
> 0.24042 ...
>  .. .. ..$ p.value     : num [1:6351] 0.239 0.968 0.176 0.494 0.624 ...
>  .. ..$ : num [1:6351, 1:5] -12.89 -9.22 -10.72 -11.54 -9.56 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:5] "(Intercept)" "model.factorscond06"
> "model.factorscond07" "model.factorscond08" ...
>  .. ..$ : num [1:6351, 1:4] -12.63 -9.22 -10.52 -11.63 -9.62 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:6351] "38536" "38537" "38538" "38539" ...
>  .. .. .. ..$ : chr [1:4] "(Intercept)" "model.factorscond07"
> "model.factorscond08" "model.factorscond10"
>  .. ..$ : num [1:17, 1:5] 1 1 1 1 1 1 1 1 1 1 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:17] "1" "2" "3" "4" ...
>  .. .. .. ..$ : chr [1:5] "(Intercept)" "model.factorscond06"
> "model.factorscond07" "model.factorscond08" ...
>  .. .. ..- attr(*, "assign")= int [1:5] 0 1 1 1 1
>  .. .. ..- attr(*, "contrasts")=List of 1
>  .. .. .. ..$ model.factors: chr "contr.treatment"
>  .. ..$ : num [1:17, 1:4] 1 1 1 1 1 1 1 1 1 1 ...
>  .. .. ..- attr(*, "dimnames")=List of 2
>  .. .. .. ..$ : chr [1:17] "1" "2" "3" "4" ...
>  .. .. .. ..$ : chr [1:4] "(Intercept)" "model.factorscond07"
> "model.factorscond08" "model.factorscond10"
>  .. ..$ : num 0.101
>  .. ..$ : chr "model.factorscond06"
>
> These are some examples of the results:
>
> --------------------
> edgeR out:
>
> LR.statistic:  0.975075487675
> logConc:  -13.3931239073
> p.value:  0.323417620577
> logFC:  NaN
> geneID:  38956
>
> Raw data:
>
> Condition 1:
> replicate 1: 2,372,532 good reads, female: 8.0
> replicate 2: 3,966,968 good reads, female: no signal
> replicate 3: 1,389,571 good reads, male: no signal
>
> Condition 2:
> replicate 1: 3,102,608 good reads, male: 5.0
> replicate 2: 3,451,983 good reads, male: no signal
> replicate 3: 2,892,192 good reads, male: no signal
> replicate 4: 3,620,124 good reads, male: 5.0
> replicate 5: 2,640,968 good reads, female: no signal
> --------------------
>
> --------------------
> edgeR out:
>
> LR.statistic:  3.57045101322
> logConc:  -13.6684814046
> p.value:  0.0588163345523
> logFC:  NaN
> geneID:  38959
>
> Raw data:
>
> Condition 1:
> replicate 1: 2,372,532 good reads, female: 5.0
> replicate 2: 3,966,968 good reads, female: 5.0
> replicate 3: 1,389,571 good reads, male: no signal
>
> Condition 2:
>
> replicate 1: 3,102,608 good reads, male: no signal
> replicate 2: 3,451,983 good reads, male: no signal
> replicate 3: 2,892,192 good reads, male: 7.0
> replicate 4: 3,620,124 good reads, male: no signal
> replicate 5: 2,640,968 good reads, female: no signal
> --------------------
>
> --------------------
> edgeR out:
>
>
> LR.statistic:  1.81602638091
> logConc:  -11.265720531
> p.value:  0.177786996458
> logFC:  NaN
> geneID:  38965
>
> Raw data:
>
> Condition 1:
>
> replicate 1: 2,372,532 good reads, female: 22.0
> replicate 2: 3,966,968 good reads, female: 27.0
> replicate 3: 1,389,571 good reads, male: 5.0
>
> Condition 2:
>
> replicate 1: 3,102,608 good reads, male: 26.0
> replicate 2: 3,451,983 good reads, male: 26.0
> replicate 3: 2,892,192 good reads, male: 9.0
> replicate 4: 3,620,124 good reads, male: 36.0
> replicate 5: 2,640,968 good reads, female: 55.0
> --------------------
>
>
> -- 
> Cheers,
>
> Nick Schurch
>
> Data Analysis Group (The Barton Group),
> School of Life Sciences,
> University of Dundee,
> Dow St,
> Dundee,
> DD1 5EH,
> Scotland,
> UK
>
> Tel: +44 1382 388707
> Fax: +44 1382 345 893

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list