[BioC] edgeR: common vs tagwise dispersion

Gordon K Smyth smyth at wehi.EDU.AU
Thu Apr 22 01:19:51 CEST 2010


Dear Ann,

On 2010-04-21, at 12:18 AM, Ann Hess wrote:

> I am using edgeR to look for differentially abundant segments between 
> two groups (data generated using high throughput sequencing). I have 3 
> (pooled) biological reps per group and a total of 18760 segments (83 
> rows with zero count are removed by edger).
>
> As a first approach, I used the common dispersion method and found the 
> estimated common dispersion to be 0.135.

Your common dispersion value shows substantial biological variation 
between your replicates, with a CV of nearly 40% in expression levels 
between samples.  This is typical of what we have observed when the 
biological replicates are separate individuals or animals.

> After looking at the top 10 segments, I find that there tends to be a 
> single large value (different for each segment) that is bringing up the 
> logFC.

We've seen this in other datasets also.  It seems to be a type of 
biological variation that RNA-seq makes very evident.

> I tried using moderated tagwise dispersion (using prior.n=50 and 25) and 
> found that the results are largely the same as common dispersion 
> approach (not shown).

The prior.n values you're choosing are basically too large.  We generally 
recommend around 30-50 prior df.  Since you have 4 residual df per 
segment, this translates to prior.n=10.  You could even go a little lower, 
but not prior.n=0.

> When I look at the tagwise dispersion values for the top 10 hits, I find 
> that the estimated tagwise dispersion values are greater that the 
> estimated common dispersion (not shown).
>
> To look into things further, I ran the same analysis but now with 
> prior.n=0 (no moderation/squeezing).  The top 10 hits are now completely 
> different and the estimated tagwise dispersion values for the top 10 are 
> very small.  (Looking at the top 10 seems to suggest that I could use a 
> Poisson distribution.)

We don't recommend no moderation.  There are just too few df to estimate 
the dispersion reliable for individual transcripts.  The top segments in 
such a list will naturally tend to have small dispersions, because you're 
essentially selecting for this.  It isn't good evidence that the Poisson 
distribution is a good fit.

> Questions:
> 1.    Should I be concerned that the results are so different depending
> on whether common dispersion (almost equivalent to moderated tagwise
> dispersion) or no-moderation tagwise dispersion is used?  Based on
> FDR<0.05, there is only about 10% overlap between the two approaches.

No, this is to be expected.  The reason we developed moderated methods is 
that no-moderation is just not reliable.

> 2.  Im not sure how to interpret the tagwise dispersion values for the 
> top hits: common dispersion method picks up segments with large tagwise 
> dispersion, no moderation method picks up segments with small tagwise 
> dispersion.

In a genome wide experiment, some segments will have large and some will 
have very small sample variances just by chance.  Furthermore, some 
features are genuinely hypervariable between individuals.  The best 
strategy is likely to be to choose a smaller prior.n, which will give you 
a more even handed compromise between common and tagwise dispersion. 
This will downweight the segments with very large variation between 
individuals in the toplist, while still giving stable estimates of tagwise 
dispersion.

Best wishes
Gordon

-----------------------------------------------
Associate Professor Gordon K Smyth,
NHMRC Senior Research Fellow,
Bioinformatics Division, 
Walter and Eliza Hall Institute of Medical Research, 
1G Royal Parade, Parkville, Vic 3052, Australia.
Tel: (03) 9345 2326, Fax (03) 9347 0852,
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth



> I am using edgeR_1.4.7 with R version 2.10.1.
>
>> #COMMON DISPERSION APPROACH
>> library(edgeR)
>> df <- DGEList(counts=Reads, group=c(0,0,0,1,1,1), genes=Annotation$Description)
>> df$samples
>      group lib.size
> C1        0  4488940
> C2        0  2437107
> C3        0  2600316
> T1        1  3935852
> T2        1  3806079
> T3        1  3913694
>> df <- estimateCommonDisp(df)
>> df$common.dispersion
> [1] 0.1346658
>> df.com<-exactTest(df)
> Comparison of groups:  1 ? 0
>> CDtop10<-topTags(df.com)$table
>> CDtop10[,-1]
>     logConc     logFC       PValue          FDR
> 6145  -12.77011  7.490945 2.002637e-37 3.740325e-33
> 15580 -12.32428  6.621865 2.854360e-32 2.665544e-28
> 1565  -12.21365  6.311500 2.936737e-30 1.828315e-26
> 15718 -13.94448 -6.050904 6.517136e-28 3.043014e-24
> 1154  -13.69624 -5.326718 1.143794e-23 4.272527e-20
> 15630 -17.02012  5.975869 1.743145e-21 5.426120e-18
> 341   -18.60859 -6.565039 4.125655e-19 1.100784e-15
> 16351 -14.64956  4.565285 1.375746e-18 3.211850e-15
> 6468  -15.86990 -4.712918 3.248713e-18 6.741802e-15
> 4891  -16.96347  5.181179 7.436516e-18 1.388918e-14
>> CDtopIDs<-as.numeric(row.names(CDtop10))
>> df$counts[CDtopIDs,]
>       C1   C2   C3    T1    T2    T3
> [1,]  31   36   28    45    57 22440
> [2,]  77   45   61 22745    47    55
> [3,]  49   68   85    76   210 21738
> [4,]  35 3729   34    37    22    32
> [5,]  28   69 3636    25    25    89
> [6,]   4    2    3     7    25   668
> [7,]   2    3  188     1     0     2
> [8,]  51    8   23   301   296  1619
> [9,]  12   10  652    14    13    11
> [10,]   4    5    3   540     6    10
>
>
>> #TAGWISE DISPERSION APPROACH
>> fprior <- estimateSmoothing(df)
>> fprior
> [1] 6329.643
>
> #I also tried prior.n=25 and prior.n=50, but results not shown.
>> df<-estimateTagwiseDisp(df, prior.n = 0)
>> quantile(df$tagwise.dispersion)
>          0%          25%          50%          75%         100%
> 1.001001e-03 2.151338e-02 7.667087e-02 1.715599e-01 9.990000e+02
>> df.tgw<-exactTest(df,common.disp=FALSE)
>> TGWtop10<-topTags(df.tgw)$table
>> TGWtop10[,-1]
>        logConc      logFC       PValue          FDR
> 2659  -13.14601 -0.7454279 2.722274e-25 5.084391e-21
> 11865 -14.21354 -1.4925866 5.769090e-22 5.387465e-18
> 13066 -16.14612 -1.3689788 2.176835e-15 1.355225e-11
> 12381 -15.12798 -0.9772265 5.676831e-15 2.650654e-11
> 17206 -17.37537 -2.0234616 1.808245e-14 5.722016e-11
> 1172  -13.15737 -0.5535657 1.838202e-14 5.722016e-11
> 8678  -15.78098 -1.4312623 5.231726e-13 1.395899e-09
> 251   -15.03996 -0.8806583 1.137238e-12 2.655024e-09
> 8466  -14.50024 -0.7195308 1.861731e-12 3.863507e-09
> 8472  -15.35444 -0.9235374 5.519857e-12 1.030944e-08
>> TGWtopIDs<-as.numeric(row.names(TGWtop10))
>> df$counts[TGWtopIDs,]
>     C1   C2   C3    T1    T2    T3
> [1,] 685 346 337   340   340   313
> [2,] 382 199 256   121   103   142
> [3,]  97  59  55    29    36    35
> [4,] 171  91 111    87    73    72
> [5,]  54  24  35    12     8    14
> [6,] 629 295 345   354   352   347
> [7,] 138  58  84    41    47    38
> [8,] 200  89  96    93    75    87
> [9,] 231 138 157   149   115   128
> [10,] 145  87  81    74    75    53
>
>> df$tagwise.dispersion[TGWtopIDs]
> [1] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
> [6] 0.001001001 0.011153172 0.001001001 0.001001001 0.001001001
>> df$tagwise.dispersion[CDtopIDs]
> [1] 3.1369561 3.0528706 2.6123364 2.4261316 2.4261316 1.8815105
> [7] 3.2246046 0.6054731 2.0582920 2.1059294


______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list