[BioC] Help with promoter analysis

Yuan Hao yuan.x.hao at gmail.com
Mon Feb 27 13:20:23 CET 2012


Hi,

I think the first elements defined in the "phyper()" function is the  
quantile, so the scenario shall be like this:
 > phyper(9,1000,30000-1000,20,lower.tail=F)
[1] 2.209245e-10

Or, you can use 'fisher.test()':
 > fisher.test(matrix(c(10,20-10,1000-10,30000-1000-20+10), 
2,2),alternative="greater")$p.value
[1] 2.209245e-10

Cheers,
Yuan


On 27 Feb 2012, at 11:34, Alex Gutteridge wrote:

> On 27.02.2012 11:17, Davy wrote:
>> Hi all,
>> Hoping someone could give me a bit of direction here.
>>
>> I have a set of genes which are all members of the same pathway.
>>
>> I want to identify if there are any transcription factor binding  
>> sites
>> (TFBS) in the "promoters" (so far defined as 5kb upstream of the  
>> TSS) that
>> are more common to genes among the pathway.
>>
>> I have managed to get the 5kb upstream using biomaRt (although the  
>> query
>> throws an intermittent error, moaning about the upstream_flank  
>> filter,
>> doesn't happen all the time, it's weird!)
>>
>> I also managed to download all the JASPAR matrices, parse the file  
>> for only
>> human ones and convert them into position weight matrices.
>>
>> Lastly, I have produced a table of counts of each human TFBS motif  
>> in each
>> of my genes using countPWM(pwm, seq, cutoff="90%")
>>
>> This is as far as I have gotten and am simply wondering what do I  
>> do next.
>>> From some reading the hypergeometric distribution is used in this  
>>> situation
>> but I am not sure what metrics to place in as the white balls  
>> drawn, total
>> white balls, black balls etc., for those of you familiar with the
>> hypergeometric distribution.
>>
>> I read that perhaps I should compare to a background set of genes,  
>> some
>> sources say all other genes. This seems like overkill.
>>
>> Any help is appreciated.
>> Cheers,
>> Davy
>
> Hi Davy,
>
> Your second paragraph is a little vague/confusing ('more common'  
> than what?). But if the question is does a given motif appear more  
> often in your pathway genes than one would expect by chance from a  
> random sampling of genes from the genome then the hypergeometric  
> seems appropriate. The nature of the white/black balls depends a  
> little on how you initially selected your genes and the precise  
> question you wish to ask, but essentially it will be:
>
> White balls: All genes in the genome (or other background set) that  
> contain your motif
> Black balls: All genes in the genome (or other background set) that  
> don't contain your motif
> Balls drawn: All genes in your pathway
> White balls drawn: Genes in your pathway that contain the motif
>
> So if 1000 genes contain the motif, there are 30,000 genes in the  
> genome, 20 genes in the pathway and 10 genes in the pathway contain  
> the motif then the call to phyper would be:
>
>> phyper(10,1000,30000-1000,20,lower.tail=F)
> [1] 6.820356e-12
>
> -- 
> Alex Gutteridge
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list