[BioC] strange results with edgeR::goodTuring

Gordon K Smyth smyth at wehi.EDU.AU
Thu Sep 6 09:26:45 CEST 2012


Dear Francois,

Thanks to new C code from Aaron Lun, I have now committed a fixed version 
of goodTuring() to edgeR on the BioC devel repository.

Best wishes
Gordon

---------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
http://www.statsci.org/smyth

On Tue, 28 Aug 2012, Francois Pepin wrote:

> Thanks for checking it out.
>
> I'll see if I can find the bug or just work around it.
>
> I'm not actually using edgeR in this case other than this function. I 
> was only looking for a an existing implementation of the Good-Turing 
> method and found it in edgeR.
>
> François
>
> On Aug 28, 2012, at 3:32 , Gordon K Smyth wrote:
>
>> Hi Francois,
>>
>> Well, looks like your data example has exposed a bug in my R
>> implementation of the Good-Turing algorithm.  I just ran the original C
>> code for the algorithm on your data, and it gives the following output:
>>
>> 0	0
>> 312	0.0001363
>> 14491	0.006316
>> 16401	0.007149
>> 65124	0.02839
>> 129797	0.05657
>> 323321	0.1409
>> 366051	0.1595
>> 368599	0.1607
>> 405261	0.1766
>> 604962	0.2637
>>
>> I'll have to think about what to do about this.  I don't really have time
>> to track down the bug.  We could bring C code into edgeR instead, but the
>> original C code would need some porting.  The R code gives identical
>> results to the C for longer vectors with a more typical pile-up of
>> frequencies.
>>
>> I wonder what you mean when you say you want to estimate what kind of
>> pseudo counts to use.  In edgeR terminology, the pseudo counts are
>> computed internally, and the user doesn't get to choose them.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Mon, 27 Aug 2012 12:00:19 -0700
>>> From: "Francois Pepin" <francois.pepin at sequentainc.com>
>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>> Subject: [BioC] strange results with edgeR::goodTuring
>>>
>>> Hi everyone,
>>>
>>> I'm trying to use the goodTuring function in edgeR to estimate what kind
>>> of pseudocounts to use and I'm getting strange results with small number
>>> of categories:
>>>
>>> x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
>>> y<- goodTuring(x)
>>> y
>>> $count
>>> [1]    312  14491  16401  65124 129797 323321 366051 368599 405261 604962
>>>
>>> $proportion
>>> [1] 0 0 0 0 0 0 0 0 0 1
>>>
>>> $P0
>>> [1] 0
>>>
>>> $n0
>>> [1] 0
>>>
>>>
>>> If I'm understanding this properly, y$proportion is telling me that I
>>> should expect all my counts to fall under the last category, which does
>>> not make sense. I would expect something pretty close to x/sum(x)
>>> instead.
>>>
>>> This is a bit of a toy example and I'm mostly interested in cases where
>>> I have more categories but it would be nice if this could work in all
>>> cases.
>>>
>>> sessionInfo()
>>> R version 2.15.1 (2012-06-22)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>> [1] edgeR_2.6.9   limma_3.12.1  dataframe_2.5
>>>
>>>
>>> Thanks,
>>>
>>> Fran?ois

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:5}}


More information about the Bioconductor mailing list