[BioC] totalTest number in ChipPeakAnno

Tue Nov 16 19:41:48 CET 2010

Hello Binbin,

It would be helpful to describe your problem and post to the whole message board.  (There are many experts who probably can be more helpful than myself :-))  That said, I think you are referring to the "NaN" error and below are my thoughts (Julie Zhu also answered this a couple of times and her reply is probably in the archives).

When calling the makeVennDiagram function you want to set the totalTest number to something that is larger than the experimentally determined peak number.  As far as I know, the totalTest number is used for the hypergeometric sampling that is used to determine if the overlap between two datasets is more than would be expected by chance.  So one way to sort this out using biological information is to think about the maximum number of possible binding events and use that as the totalTest number.  For example, if you are studying a sequence-specific DNA binding protein with a known motif you could count that number of times that motif occurs in the genome and compare that to the number of peaks you have experimentally determined.

Motifs = 500
Peaks = 200
Peaks w/ motif = 180 (90%)
"upper limit"	   = 500
new "upper limit" for totalTest = .9 x 500 = 450

Now if your working with a sequence-independent binding factor it can get tricky.  One approach would be to determine the mean peak width.  Then divide the whole genome sequence by this number to get an upper limit.  This is probably way to high so using additional information such as if the protein binds intergenic or ORFs could bring the number down but make it more relevant to the biological experiment.  For example:

peaks			= 75
intergenic peaks	= 70
ORF peaks		= 5
mean peak width = 50 base pairs
genome size 	= 10000 base pairs
"upper limit" 		= 10000/ 50 = 200 (possible peaks)
intergenic seq 	= 4000 base pairs
new "upper limit"	=  4000/50 = 80 (possible intergenic peaks)

I was working with something more like the second case and I felt the totalTest based on the total genome was quite relaxed and based on the intergenic sequence only was quite stringent so somewhere in the middle might be better but most importantly I feel I am standing on some solid biological reasoning  for determining the amount of sampling.

Hope this helps and I would be interested to here if anybody has some critiques of this approach or additional suggestions.

Best,

Noah

On Nov 16, 2010, at 7:31 AM, Binbin Liu wrote:

> Dear Noah,
> 
> I saw your post on bioconductor mailing list regarding the totalTest number for the P-val calculation in ChipPeakAnno :: makeVennDiagram(). I am having the same problem. Can I ask how you got it sorted?
> 
> 
> Many thanks.
> 
> Binbin