[BioC] edgeR outlier question

Gordon K Smyth smyth at wehi.EDU.AU
Sun May 13 12:16:10 CEST 2012


I don't agree that switching to purely genewise dispersion estimates is 
the best solution.  One can of course achieve this in edgeR using a very 
small prior.n, but I think there is still plenty of benefit to be had from 
borrowing information between genes at sample sizes of 10 and above.

The beauty of the empirical Bayes moderation approach of edgeR is that the 
dispersion estimators transition smoothly from nearly global estimators at 
small n to nearly purely genewise at large n.  This is a natural 
consequence of the fact that the prior stays roughly constant while the 
amount of information in the data increases.  This smooth transition seems 
preferable to me that than having to switch between very different 
estimation strategies from one sample size to another.

Note that edgeR moderates genewise dispersions both up and down towards a 
global estimate, so it doesn't necessarily "shrink".  Some genewise 
estimates are decreased while others are increased, and the latter is just 
as important as the former.  I prefer to call it "moderation" or 
"smoothing" or "squeezing".

Gordon


> Date: Tue, 08 May 2012 00:40:49 +0200
> From: Wolfgang Huber <whuber at embl.de>
> To: bioconductor at r-project.org
> Subject: Re: [BioC] edgeR outlier question
> Message-ID: <4FA84F71.8020806 at embl.de>
> Content-Type: text/plain; charset=ISO-8859-1; format=flowed
>
> Dear Simon, Alessandro
>
> I assume that the inference that you are refering to is based on
> shrunken (empirical Bayes) estimates of the dispersions. Perhaps what
> you are observing is that the shrinkage turns out to be too strong for
> your data - such that genes with large empirical dispersion (driven by
> the 'outliers') get their estimate shrunk too much, while their apparent
> fold change is not shrunk, which would make them appear significant.
>
> With 10+10 replicates you do not need (and probably don't want) to
> shrink your dispersion estimates, you can just use the empirical values.
> Others are better qualified to point how to best achieve this with
> edgeR. (In DESeq, this is controlled by the parameter 'sharingMode' of
> the 'estimateDispersions' function, which you could set to
> 'gene-est-only'. Its default is 'maximum', which we find useful for
> situations with fewer replicates.)
>
> Also, to improve power, it is always advisable to perform 'independent
> filtering' of genes before the testing, in order to weed out genes that
> anyway have negligible chance of being differentially expressed. This
> concept is explained in [1]. A suitable filter criterion would be e.g.
> the median of a gene's values across samples (irrespective of condition!).
>
> [1] Richard Bourgon et al., Independent filtering increases detection
> power for high-throughput experiments. PNAS 2010
> http://www.pnas.org/content/107/21/9546.long
>
> 	Best wishes
> 	Wolfgang
>
>
>
>
> May/7/12 9:19 PM, Simon Melov scripsit::
>> I have a reasonable RNASeq data set of 10 biological replicates of a
>> control group versus 10 biological replicates experimental I've gone
>> through the edgeR workflow, and get a nice list of about 1000 genes
>> differentially expressed due to the experimental manipulation. I
>> input the data based on total reads per gene (I'd like to get to
>> exons too, but first things first). The data is obtained via a paired
>> end strategy, so its pretty good quality. The number of reads per
>> sample (library) is about 10 million reads each. My question is, as I
>> go through list of significant genes which are differentially
>> expressed between the two groups  (normalized via the workflow),
>> ranked by BH FDR down to 0.05, I see genes being judged as
>> differentially expressed which have very low expression in most
>> samples, yet are thrown off by 1 or 2 values, thereby achieving
>> statistical significance. For example, a gene might have between 1
>> and 2 counts per million reads in one group, and be basically the !
>> same in the other group, but one of the values is perhaps at a 1000
>> or so counts, which seems to throw off the entire group, thereby
>> becoming "significant".
>>
>> Shouldn't edgeR take into account this sort of biological variation
>> within a group and account for it in assessing significance? Its
>> clear that in the above example, that sample is an outlier, and
>> therefore the variance is so high, so it shouldn't be ranked as being
>> differentially expressed. I filtered the data by applying the
>> criteria of at least 1 count per sample, and I have to have at least
>> 8 samples per group which have this. Should there be an additional
>> filtering criteria to exclude these outliers? or doesn't edgeR take
>> into account this sort of situation (I thought it did).
>>
>> Am I doing something wrong here?
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list