[BioC] ChIPpeakAnno-makeVennDiagram question

Wed Aug 21 20:03:35 CEST 2013

Dear Camila,

I have sent you the response to your earlier email. For your convenience, I
also pasted here.

I think you could use the 1% accessibility data to modify the number of
sites in your genome wide motif search number as I suggested below.

Alternatively, you could try set totalTest  = peak1 + peak2 - peaks.1and2
where peaks.1and2 is the number of peaks common to set1 and set2.

Best regards,

Julie

########### start of previous reply #######

For the ChIPpeakAnno paper, I have the peaks with all p-values.

If you do not have this kind of information, then the most conservative
estimate of totalTest would be (peak1 + peak2) - peaks.1and2. The smaller
the totalTest you set, the higher the p-value is. So the p-value you obtain
will be the most conservative estimate. If this approach is too conservative
for your purpose, then you could set the totalTest to be the number of
potential binding sites in the genome by searching for the motif (could be
obtained from the literature or your peak sets). Certainly if accessibility
data is available, then you could limit your search within those accessible
regions.

Hope this helps.

Best regards,

Julie 

On 8/21/13 1:40 PM, "Camila Lopez-anido" <lopezanido at wisc.edu> wrote:

> Hi again, 
> 
> Based on findings from ENCODE (2012), another approach I thought of was to
> assume that 1% (conservative estimate) of the rat genome contains open
> chromatin... and use that number for the totalTest value? Since the rat genome
> is 2,909,698,938bp, then totalTest (i.e. 1%) = 29096989.38
> 
> 
> However, the p-value = 0 and I'm not sure if I can publish this p-value or
> would need to figure out a way to get at the actual p-value, even though it's
> really really small?
> 
> 
> Any thoughts/insight would be appreciated!
> 
> 
> Thanks again, 
> Camila 
> 
> On 08/21/13, "Camila Lopez-anido"  wrote:
>> Thanks! 
>> 
>> I'm still considering different systematic approaches to determine the
>> "totalTest" value, which I'm considering as the total possible transcription
>> factor (TF) binding events in the rat genome. I saw that you previously said
>> that one could add up the total number of peaks with p-value=1 or
>> FDR=1 (http://permalink.gmane.org/gmane.science.biology.informatics.conductor
>> /30115), but I don't have access to the raw data.
>> 
>> 
>> One idea I had was to *assume* that the total number of peaks with p-value=1
>> is 1.5X greater than those called significant.
>> SO, if peak1 set = 7602 peaks, and peak2 set = 25335 peaks, then could I use
>> totalTest = (7602 + 25335 = 32937) * 1.5 = 49405.5
>> 
>> 
>> If I use totalTest = (peak1 + peak2) * 1.5, then I could apply this
>> systematic approach to make different comparisons between various datasets.
>> This seems like a better idea than arbitrarily picking a value greater than
>> the largest peak data set. Does this make sense, or would you recommend a
>> different approach/assumptions depending on the size of the genome (or area
>> of open chromatin accessible for TF binding)?
>> 
>> 
>> Also, I was trying to figure out how the value 1580 was chosen for totalTest
>> in the Zhu (2010) ChIPpeakAnno manuscript? Was a systematic approach used
>> similar to one I tried to come up with above?
>> 
>> 
>> Best, 
>> Camila 
>> 
>> 
>> 
>> On 08/15/13, "Zhu, Lihua (Julie)"
>> wrote:
>>> Camila,
>>> 
>>> TotalTest needs to be at least as large as the number of peaks in
>>> CNSallPeaks_rd and OPColig2Peaks_rd. For example, if there are 1000 peaks in
>>> CNSallPeaks_rd and 2000 peaks in OPColig2Peaks_rd, then totalTest needs to
>>> be >= 2000.
>>> 
>>> For details, please refer to an old post at
>>> http://grokbase.com/t/r/bioconductor/123cz0jc0b/bioc-question-about-makevenn
>>> diagram. Thanks!
>>> 
>>> Best regards,
>>> 
>>> Julie
>>> 
>>> 
>>> On 8/15/13 6:08 PM, "Camila Lopez-anido" <lopezanido at wisc.edu> wrote:
>>> 
>>>> Hi, 
>>>> 
>>>> I'm a graduate student at the UW-Madison, and I'm getting an error message
>>>> when I use makeVennDiagram() -I was wondering if there is a simple
>>>> solution? I
>>>> saw that someone else had asked about something related (similar message)
>>>> online, but I couldn't find how they fixed the problem?
>>>> 
>>>> 
>>>> I inputed two RangedData files, which I successfully performed
>>>> findOverlappingPeaks() with:
>>>> 
>>>> 
>>>>> makeVennDiagram(RangedDataList(CNSallPeaks_rd, OPColig2Peaks_rd),
>>>>> NameOfPeaks=c("CNSall","OPColig2"), maxgap=0, minoverlap=1, totalTest=100,
>>>>> cex=1, counts.col="red", useFeature=FALSE)
>>>> 
>>>> 
>>>> #Error in seq.default(cnt, length.out = counts[i]) :
>>>> #length must be non-negative number
>>>> #In addition: Warning message:
>>>> #In phyper(p1.and.p2 - 1, p2, totalTest - p2, p1, lower.tail = FALSE, :
>>>> #NaNs produced
>>>> 
>>>> 
>>>> 
>>>> Is there a simple solution to this?
>>>> 
>>>> 
>>>> Thanks, 
>>>> Camila Lopez-Anido