[BioC] edgeR: common vs tagwise dispersion

Wed Apr 21 12:32:09 CEST 2010

Hi Ann.

See comments below.

On 2010-04-21, at 12:18 AM, Ann Hess wrote:

> I am using edgeR to look for differentially abundant “segments”
> between two groups (data generated using high throughput sequencing).
> I have 3 (pooled) biological reps per group and a total of 18760
> segments (83 rows with zero count are removed by edger).
> 
> As a first approach, I used the common dispersion method and found the
> estimated common dispersion to be 0.135.

The common dispersion value shows substantial biological variation between your replicates, with a CV of nearly 40% in expression levels between samples.  This is typical of what we have observed when the biological replicates are separate individuals or animals.

> After looking at the top 10
> segments, I find that there tends to be a single large value
> (different for each segment) that is bringing up the logFC.

We've seen this in other datasets also.  It seems to be a type of biological variation that RNA-seq makes very evident.

> I tried using moderated tagwise dispersion (using prior.n=50 and 25)
> and found that the results are largely the same as common dispersion
> approach (not shown).

The prior.n values you're choosing are basically too large.  We generally recommend around 30-50 prior df.  Since you have 4 residual df per segment, this translates to prior.n=10.  You could even go a little lower, but not prior.n=0.

> When I look at the tagwise dispersion values
> for the top 10 hits, I find that the estimated tagwise dispersion
> values are greater that the estimated common dispersion (not shown).
> 
> To look into things further, I ran the same analysis but now with
> prior.n=0 (no moderation/squeezing).  The top 10 hits are now
> completely different and the estimated tagwise dispersion values for
> the top 10 are very small.  (Looking at the top 10 seems to suggest
> that I could use a Poisson distribution.)

We don't recommend no moderation.  There are just too few df to estimate the dispersion reliably for individual transcripts.  The top segments in such a list will naturally tend to have small dispersions, because you're essentially selecting for this.  It isn't necessarily evidence that the Poisson distribution is a good fit.

> Questions:
> 1.	Should I be concerned that the results are so different depending
> on whether common dispersion (almost equivalent to moderated tagwise
> dispersion) or no-moderation tagwise dispersion is used?  Based on
> FDR<0.05, there is only about 10% overlap between the two approaches.

No, this is to be expected.  You are comparing the two extremes of dispersion estimation.  The reason we developed moderated methods is that no-moderation is just not reliable.

> 2.	I’m not sure how to interpret the tagwise dispersion values for the
> top hits: common dispersion method picks up segments with large
> tagwise dispersion, no moderation method picks up segments with small
> tagwise dispersion.
> 
> I am using edgeR_1.4.7 with R version 2.10.1.

You'd probably be better off choosing a smaller prior.n which gives you a more even handed compromise between common and tagwise dispersion.

Hope that helps.

Cheers,
Mark

>> #COMMON DISPERSION APPROACH
>> library(edgeR)
>> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description)
>> df$samples
>      group lib.size
> C1        0  4488940
> C2        0  2437107
> C3        0  2600316
> T1        1  3935852
> T2        1  3806079
> T3        1  3913694
>> df <- estimateCommonDisp(df)
>> df$common.dispersion
> [1] 0.1346658
>> df.com<-exactTest(df)
> Comparison of groups:  1 – 0
>> CDtop10<-topTags(df.com)$table
>> CDtop10[,-1]
>     logConc     logFC       PValue          FDR
> 6145  -12.77011  7.490945 2.002637e-37 3.740325e-33
> 15580 -12.32428  6.621865 2.854360e-32 2.665544e-28
> 1565  -12.21365  6.311500 2.936737e-30 1.828315e-26
> 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24
> 1154  -13.69624 -5.326718 1.143794e-23 4.272527e-20
> 15630 -17.02012  5.975869 1.743145e-21 5.426120e-18
> 341   -18.60859 -6.565039 4.125655e-19 1.100784e-15
> 16351 -14.64956  4.565285 1.375746e-18 3.211850e-15
> 6468  -15.86990 -4.712918 3.248713e-18 6.741802e-15
> 4891  -16.96347  5.181179 7.436516e-18 1.388918e-14
>> CDtopIDs<-as.numeric(row.names(CDtop10))
>> df$counts[CDtopIDs,]
>       C1   C2   C3    T1    T2    T3
> [1,]  31   36   28    45    57 22440
> [2,]  77   45   61 22745    47    55
> [3,]  49   68   85    76   210 21738
> [4,]  35 3729   34    37    22    32
> [5,]  28   69 3636    25    25    89
> [6,]   4    2    3     7    25   668
> [7,]   2    3  188     1     0     2
> [8,]  51    8   23   301   296  1619
> [9,]  12   10  652    14    13    11
> [10,]   4    5    3   540     6    10
> 
> 
>> #TAGWISE DISPERSION APPROACH
>> fprior <- estimateSmoothing(df)
>> fprior
> [1] 6329.643
> 
> #I also tried prior.n=25 and prior.n=50, but results not shown.
>> df<-estimateTagwiseDisp(df, prior.n = 0)
>> quantile(df$tagwise.dispersion)
>          0%          25%          50%          75%         100%
> 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02
>> df.tgw<-exactTest(df,common.disp=FALSE)
>> TGWtop10<-topTags(df.tgw)$table
>> TGWtop10[,-1]
>        logConc      logFC       PValue          FDR
> 2659  -13.14601 -0.7454279 2.722274e-25 5.084391e-21
> 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18
> 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11
> 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11
> 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11
> 1172  -13.15737 -0.5535657 1.838202e-14 5.722016e-11
> 8678  -15.78098 -1.4312623 5.231726e-13 1.395899e-09
> 251   -15.03996 -0.8806583 1.137238e-12 2.655024e-09
> 8466  -14.50024 -0.7195308 1.861731e-12 3.863507e-09
> 8472  -15.35444 -0.9235374 5.519857e-12 1.030944e-08
>> TGWtopIDs<-as.numeric(row.names(TGWtop10))
>> df$counts[TGWtopIDs,]
>     C1   C2   C3    T1    T2    T3
> [1,] 685 346 337   340   340   313
> [2,] 382 199 256   121   103   142
> [3,]  97  59  55    29    36    35
> [4,] 171  91 111    87    73    72
> [5,]  54  24  35    12     8    14
> [6,] 629 295 345   354   352   347
> [7,] 138  58  84    41    47    38
> [8,] 200  89  96    93    75    87
> [9,] 231 138 157   149   115   128
> [10,] 145  87  81    74    75    53
> 
>> df$tagwise.dispersion[TGWtopIDs]
> [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
> [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
>> df$tagwise.dispersion[CDtopIDs]
> [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105
> [7] 3.2246046 0.6054731 2.0582920 2.1059294
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

------------------------------
Mark Robinson, PhD (Melb)
Epigenetics Laboratory, Garvan
Bioinformatics Division, WEHI
e: m.robinson at garvan.org.au
e: mrobinson at wehi.edu.au
p: +61 (0)3 9345 2628
f: +61 (0)3 9347 0852
------------------------------

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:6}}