[BioC] edgeR: what to do with no replicates

Gordon K Smyth smyth at wehi.EDU.AU
Wed Nov 23 02:12:29 CET 2011


Dear Bogdan,

What I suggest is biological replicates!

The edgeR analysis without replicates (with robust=TRUE) is just our 
attempt to offer something better than nothing in a bad situation.  We 
haven't published this methodology in a refereed journal and we don't use 
it in our own biological research.  It hasn't been tested extensively, so 
you should use at your own risk.  How it will go on different technologies 
remains to be seen.

Regarding your question about low count tags, I can tell you for certain 
that no valid statistical test will ever be able to find a statistically 
significant difference for counts of 1 in one group vs 3 in another, or 4 
in one group vs 6 in another.  Theoretically, the smallest possible count 
difference that could possibly yield a two-sided p-value less than 0.05 is 
0 vs 6, even if the dispersion was zero and even if no adjustment is made 
for multiple testing.  This is because the variability in the data can 
never be less than the Poisson variability inherence in the sequencing 
technology.  When the inter-library dispersion is nonzero, as it always 
is, much larger differences may be required.  When adjustment is made for 
genome-wide testing, much larger differences are required.

If you have a large number of genome locations that consistently show a 
small but non statistically significant difference, then you might obtain 
a statistically significant result by pooling locations, or else by doing 
a gene set test using this locations.  (For the gene set test possibility, 
see the last case study of the limma package.)  I don't understand enough 
about your scientific problem to know whether these approaches would be 
appropriate for your particular study.

Best wishes
Gordon

On Tue, 22 Nov 2011, Bogdan Tanasa wrote:

> Dear Gordon,
>
> thanks a lot, it would really help us. Just a tiny additional question:
> edgeR analysis on samples with no replicates shall have the same
> statistical validity on either CHIP-seq, GRO-seq, or RNA-seq data ?
>
> And, if we deal for instance with 20 000 genome locations in the genome (eg
> ALU repeats) that have very small number of tags, or very small difference
> between tag counts in various samples (let's say 1 tag in "-" sample and
> 2-3 tags in "+" sample, or 5-6 tags in "-" sample and 4-5 tags in "+"
> sample), is there any appropriate statistical model we can use for
> assessing the differential expression. What would you suggest ?
>
> thanks,
>
> Bogdan
>
>
> On Tue, Nov 22, 2011 at 4:20 PM, Gordon K Smyth <smyth at wehi.edu.au> wrote:
>
>> Dear Bogdan,
>>
>> The advice in the edgeR User's Guide about using estimateGLMCommonDisp()
>> with method="deviance",robust="**TRUE",subset=NULL is intended to be used
>> without a design matrix, or in conjunction with the immediately preceding
>> advice about removing terms from the design matrix.
>>
>> I have now rewritten the advice in the User's Guide a little to make this
>> more explicit.
>>
>> Best wishes
>> Gordon
>>
>>  Message: 2
>>> Date: Mon, 21 Nov 2011 12:41:05 -0800
>>> From: Bogdan Tanasa <tanasa at gmail.com>
>>> To: Mark Robinson <mark.robinson at imls.uzh.ch>
>>> Cc: bioconductor at stat.math.ethz.ch
>>> Subject: Re: [BioC] edgeR
>>>
>>> Hi Mark,
>>>
>>> I have two samples "Y" and "S" (with no replicates), and following edgeR
>>> manual on "what to do if you do not have replicates", I am getting the
>>> following results below. The end message is "No residual df: cannot
>>> estimate dispersion". Please could you let me know what is going wrong ..
>>> Thanks a lot !
>>>
>>>  raw.data <- read.delim("file")
>>>> d <- raw.data[,2:3]
>>>> rownames(d) <- raw.data[,1]
>>>>
>>>
>>>  group <- factor(c("Y","S"))
>>>> design <- model.matrix(~group)
>>>> d <- DGEList(counts = d, group=group)
>>>>
>>> Calculating library sizes from column totals.
>>>
>>>> dim(d)
>>>>
>>> [1] 4013    2
>>>
>>>> d <- calcNormFactors(d)
>>>> d
>>>>
>>> An object of class "DGEList"
>>> $samples
>>>  group lib.size norm.factors
>>> Y     Y   628062     1.049011
>>> S     S   422542     0.953279
>>>
>>> $counts
>>>                        Y   S
>>> chr8:53670099-53691880 132 109
>>> chr8:71033673-71122126 221 107
>>> chr8:74069636-74074658   7   1
>>> chr6:72172478-72187779 203 114
>>> chr6:72096548-72115900  36  15
>>> 4008 more rows ...
>>>
>>> $all.zeros
>>> chr8:53670099-53691880 chr8:71033673-71122126 chr8:74069636-74074658
>>>                FALSE                  FALSE                  FALSE
>>> chr6:72172478-72187779 chr6:72096548-72115900
>>>                FALSE                  FALSE
>>> 4008 more elements ...
>>>
>>>
>>>>  d<-estimateGLMCommonDisp(d,**design,method="deviance",**
>>> robust="TRUE",subset=NULL)
>>> Warning message:
>>> In estimateGLMCommonDisp.default(**y = y$counts, design = design,  :
>>>  No residual df: cannot estimate dispersion
>>>
>>> For Common Dispersion (no GLM modeling):
>>>
>>>  d<-estimateCommonDisp(d,**design)
>>>>
>>> Warning message:
>>> In estimateCommonDisp(d, design) :
>>>  There is no replication.  Setting common dispersion to 0.
>>>
>>>> d
>>>>
>>> An object of class "DGEList"
>>> $samples
>>>  group lib.size norm.factors
>>> Y     Y   628062     1.049011
>>> S     S   422542     0.953279
>>>
>>> $counts
>>>                        Y   S
>>> chr8:53670099-53691880 132 109
>>> chr8:71033673-71122126 221 107
>>> chr8:74069636-74074658   7   1
>>> chr6:72172478-72187779 203 114
>>> chr6:72096548-72115900  36  15
>>> 4008 more rows ...
>>>
>>> $all.zeros
>>> chr8:53670099-53691880 chr8:71033673-71122126 chr8:74069636-74074658
>>>                FALSE                  FALSE                  FALSE
>>> chr6:72172478-72187779 chr6:72096548-72115900
>>>                FALSE                  FALSE
>>> 4008 more elements ...
>>>
>>> $common.dispersion
>>> [1] 1e-16
>>>
>>> $pseudo.alt
>>>                               Y         S
>>> chr8:53670099-53691880 103.192083 139.42506
>>> chr8:71033673-71122126 172.781586 136.86720
>>> chr8:74069636-74074658   5.453794   1.30187
>>> chr6:72172478-72187779 158.707305 145.81970
>>> chr6:72096548-72115900  28.129216  19.20588
>>> 4008 more rows ...
>>>
>>> $conc
>>> $conc.common
>>> chr8:53670099-53691880 chr8:71033673-71122126 chr8:74069636-74074658
>>>         2.270073e-04           3.089534e-04           7.535477e-06
>>> chr6:72172478-72187779 chr6:72096548-72115900
>>>         2.985930e-04           4.803864e-05
>>> 4008 more elements ...
>>>
>>> $conc.group
>>>                                 S            Y
>>> chr8:53670099-53691880 2.706055e-04 2.003510e-04
>>> chr8:71033673-71122126 2.656403e-04 3.354361e-04
>>> chr8:74069636-74074658 2.482619e-06 1.062467e-05
>>> chr6:72172478-72187779 2.830186e-04 3.081155e-04
>>> chr6:72096548-72115900 3.723929e-05 5.464117e-05
>>> 4008 more rows ...
>>>
>>>
>>> $common.lib.size
>>> $pseudo
>>>                               Y         S
>>> chr8:53670099-53691880 103.192083 139.42506
>>> chr8:71033673-71122126 172.781586 136.86720
>>> chr8:74069636-74074658   5.453794   1.30187
>>> chr6:72172478-72187779 158.707305 145.81970
>>> chr6:72096548-72115900  28.129216  19.20588
>>> 4008 more rows ...
>>>
>>> $conc
>>> $conc.common
>>> chr8:53670099-53691880 chr8:71033673-71122126 chr8:74069636-74074658
>>>         2.270073e-04           3.089534e-04           7.535477e-06
>>> chr6:72172478-72187779 chr6:72096548-72115900
>>>         2.985930e-04           4.803864e-05
>>> 4008 more elements ...
>>>
>>> $conc.group
>>>                                 S            Y
>>> chr8:53670099-53691880 2.706055e-04 2.003510e-04
>>> chr8:71033673-71122126 2.656403e-04 3.354361e-04
>>> chr8:74069636-74074658 2.482619e-06 1.062467e-05
>>> chr6:72172478-72187779 2.830186e-04 3.081155e-04
>>> chr6:72096548-72115900 3.723929e-05 5.464117e-05
>>> 4008 more rows ...
>>>
>>>
>>> $N
>>> [1] 515153
>>>
>>>
>>> On Fri, Nov 18, 2011 at 12:39 PM, Mark Robinson
>>> <mark.robinson at imls.uzh.ch>**wrote:
>>>
>>>
>>>>
>>>> On 18.11.2011, at 21:27, Bogdan Tanasa wrote:
>>>>
>>>>  Dear all,
>>>>>
>>>>> are the functions "estimateGLMCommonDisp" and "estimateGLMTagwiseDisp"
>>>>> available in edgeR on any platform ? I am using it on Linux/Ubuntu and
>>>>> apparently, these functiosn are not available.
>>>>>
>>>>
>>>>
>>>> What version of R/edgeR are you using?  Perhaps you have an old version?
>>>>  These functions have been around for awhile.  What does sessionInfo()
>>>> give?
>>>>
>>>> Mark
>>>>
>>>>
>>>>
>>>>
>>>>> thanks,
>>>>>
>>>>> Bogdan

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list