[BioC] strange results with edgeR::goodTuring

Francois Pepin francois.pepin at sequentainc.com
Tue Aug 28 19:32:28 CEST 2012


Thanks for checking it out.

I'll see if I can find the bug or just work around it.

I'm not actually using edgeR in this case other than this function. I was only looking for a an existing implementation of the Good-Turing method and found it in edgeR.

François

On Aug 28, 2012, at 3:32 , Gordon K Smyth wrote:

> Hi Francois,
> 
> Well, looks like your data example has exposed a bug in my R 
> implementation of the Good-Turing algorithm.  I just ran the original C 
> code for the algorithm on your data, and it gives the following output:
> 
> 0	0
> 312	0.0001363
> 14491	0.006316
> 16401	0.007149
> 65124	0.02839
> 129797	0.05657
> 323321	0.1409
> 366051	0.1595
> 368599	0.1607
> 405261	0.1766
> 604962	0.2637
> 
> I'll have to think about what to do about this.  I don't really have time 
> to track down the bug.  We could bring C code into edgeR instead, but the 
> original C code would need some porting.  The R code gives identical 
> results to the C for longer vectors with a more typical pile-up of 
> frequencies.
> 
> I wonder what you mean when you say you want to estimate what kind of 
> pseudo counts to use.  In edgeR terminology, the pseudo counts are 
> computed internally, and the user doesn't get to choose them.
> 
> Best wishes
> Gordon
> 
>> Date: Mon, 27 Aug 2012 12:00:19 -0700
>> From: "Francois Pepin" <francois.pepin at sequentainc.com>
>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>> Subject: [BioC] strange results with edgeR::goodTuring
>> 
>> Hi everyone,
>> 
>> I'm trying to use the goodTuring function in edgeR to estimate what kind 
>> of pseudocounts to use and I'm getting strange results with small number 
>> of categories:
>> 
>> x<-c(312,14491,16401,65124,129797,323321,366051,368599,405261,604962)
>> y<- goodTuring(x)
>> y
>> $count
>> [1]    312  14491  16401  65124 129797 323321 366051 368599 405261 604962
>> 
>> $proportion
>> [1] 0 0 0 0 0 0 0 0 0 1
>> 
>> $P0
>> [1] 0
>> 
>> $n0
>> [1] 0
>> 
>> 
>> If I'm understanding this properly, y$proportion is telling me that I 
>> should expect all my counts to fall under the last category, which does 
>> not make sense. I would expect something pretty close to x/sum(x) 
>> instead.
>> 
>> This is a bit of a toy example and I'm mostly interested in cases where 
>> I have more categories but it would be nice if this could work in all 
>> cases.
>> 
>> sessionInfo()
>> R version 2.15.1 (2012-06-22)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>> 
>> locale:
>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>> [7] LC_PAPER=C                 LC_NAME=C
>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>> 
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>> 
>> other attached packages:
>> [1] edgeR_2.6.9   limma_3.12.1  dataframe_2.5
>> 
>> 
>> Thanks,
>> 
>> Fran?ois
> 
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:7}}



More information about the Bioconductor mailing list