[BioC] from TF DNA binding motif to downstream genes?

Sat Jan 16 00:00:35 CET 2010

Paul,
I have made a first attempt at solving the first part of your problem 
(mapping the motifs to the genome) and plan on making this easier to 
perform by adding a vmatchPDict method to the BSgenome package in BioC 
2.6. For now, here is some code that creates a RangedData object 
identifying the locations on the genome where the motifs match. You can 
then use findOverlaps against a RangedData object that contains the 
annotations that are of interest to you. Feedback is welcome.  - Patrick

## load the base libraries
library(Biostrings)
library(BSgenome)

## load the genome
library(BSgenome.Celegans.UCSC.ce2)

## create the motifs
data(HNF4alpha)

## -------------------------------------------------------------
## method for finding motif locations on genome
## the motifId column is an element identifier
## that relates back to the original motif set
matchFUN <- function(strings, chr) {
    posPDict <- strings
    negPDict <- reverseComplement(strings)
    posMatches <- matchPDict(pdict = posPDict, subject = chr)
    posCounts <- elementLengths(posMatches)
    negMatches <- matchPDict(pdict = negPDict, subject = chr)
    negCounts <- elementLengths(negMatches)
    strand <-
      strand(rep(c("+", "-"), c(sum(posCounts), sum(negCounts))))
    motifId <-
      c(rep(seq_len(length(posMatches)), posCounts),
        rep(seq_len(length(negMatches)), negCounts))  
    RangedData(c(unlist(posMatches), unlist(negMatches)),
               strand = strand, motifId = motifId)
}
bsParams <-
  new("BSParams", X = Celegans, FUN = matchFUN, simplify = TRUE)
matches <- bsapply(bsParams, strings = HNF4alpha)
nms <- names(matches)
matches <- do.call(c, unname(matches))
names(matches) <- nms
## -------------------------------------------------------------

 > matches
RangedData with 183 rows and 2 value columns across 7 spaces
         space               ranges |   strand   motifId
   <character>            <IRanges> | <factor> <integer>
1         chrI [10714238, 10714250] |        +         1
2         chrI [ 1746247,  1746259] |        +        33
3         chrI [11509260, 11509272] |        +        39
4         chrI [ 5249651,  5249663] |        +        48
5         chrI [ 5442409,  5442421] |        +        64
6         chrI [ 7949495,  7949507] |        +        64
7         chrI [ 2788492,  2788504] |        +        71
8         chrI [ 3853105,  3853117] |        +        71
9         chrI [ 6952606,  6952618] |        +        71
10        chrI [10242063, 10242075] |        -         1
...
<173 more rows>

Paul Shannon wrote:
> Can I get advice from on good ways to find genes -- perhaps with a high false-positive rate -- whose promoters contain known DNA binding motifs?
>
> Hu et al, "Profiling the Human Protein-DNA Interactome Reveals ERK2 as a Transcriptional Repressor of Interferon Signaling" identifies
>
>    17,718 PDIs [protein-DNA interactions] between 460 DNA motifs predicted to regulate 
>    transcription and 4,191 human proteins of various functional classes.
>
> I wish to take those 460 motifs -- many of them only 7 bases long -- and find the genes whose transcription they control.
>
> I suspect the answer lies in some artful use of Biostrings, BSgenome (which together provide efficient genome search), along with annotation to find the transcription start site of  known genes.   But before I start, I think it prudent to get the advice of those who may know more than me.
>
> Thanks!
>
>  - Paul
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>