[BioC] Help with promoter analysis

Alex Gutteridge alexg at ruggedtextile.com
Mon Feb 27 12:34:17 CET 2012


On 27.02.2012 11:17, Davy wrote:
> Hi all,
> Hoping someone could give me a bit of direction here.
>
> I have a set of genes which are all members of the same pathway.
>
> I want to identify if there are any transcription factor binding 
> sites
> (TFBS) in the "promoters" (so far defined as 5kb upstream of the TSS) 
> that
> are more common to genes among the pathway.
>
> I have managed to get the 5kb upstream using biomaRt (although the 
> query
> throws an intermittent error, moaning about the upstream_flank 
> filter,
> doesn't happen all the time, it's weird!)
>
> I also managed to download all the JASPAR matrices, parse the file 
> for only
> human ones and convert them into position weight matrices.
>
> Lastly, I have produced a table of counts of each human TFBS motif in 
> each
> of my genes using countPWM(pwm, seq, cutoff="90%")
>
> This is as far as I have gotten and am simply wondering what do I do 
> next.
>>From some reading the hypergeometric distribution is used in this 
>> situation
> but I am not sure what metrics to place in as the white balls drawn, 
> total
> white balls, black balls etc., for those of you familiar with the
> hypergeometric distribution.
>
> I read that perhaps I should compare to a background set of genes, 
> some
> sources say all other genes. This seems like overkill.
>
> Any help is appreciated.
> Cheers,
> Davy

Hi Davy,

Your second paragraph is a little vague/confusing ('more common' than 
what?). But if the question is does a given motif appear more often in 
your pathway genes than one would expect by chance from a random 
sampling of genes from the genome then the hypergeometric seems 
appropriate. The nature of the white/black balls depends a little on how 
you initially selected your genes and the precise question you wish to 
ask, but essentially it will be:

White balls: All genes in the genome (or other background set) that 
contain your motif
Black balls: All genes in the genome (or other background set) that 
don't contain your motif
Balls drawn: All genes in your pathway
White balls drawn: Genes in your pathway that contain the motif

So if 1000 genes contain the motif, there are 30,000 genes in the 
genome, 20 genes in the pathway and 10 genes in the pathway contain the 
motif then the call to phyper would be:

> phyper(10,1000,30000-1000,20,lower.tail=F)
[1] 6.820356e-12

-- 
Alex Gutteridge



More information about the Bioconductor mailing list