[BioC] matching transcription factor binding sites

Herve Pages hpages at fhcrc.org
Sat Apr 19 02:41:03 CEST 2008


Hi Hans-Ulrich,

matchPWM() is now part of Biostrings 2.7.47 (in BioC devel, so you need R-2.7).

Load Biostrings:

   library(Biostrings)

Suppose 'pwm' contains a Position Weight Matrix, let's say:

   pwm <- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),
                C=c( 1,  0,  1,  0,  1, 18,  0,  2),
                G=c(17,  0,  0,  0,  1,  0,  0,  3),
                T=c( 1, 20,  0,  0,  0,  1,  0,  8))

Note that this is just a standard integer matrix with the 4 DNA base letters
as row names (having these row names is mandatory).
Some low-level utility functions are available for manipulating this kind of
matrix:

   > maxWeights(pwm)  # the max weight in each column
   [1] 17 20 19 20 18 18 20  8

   > maxScore(pwm)  # the max possible score
   [1] 140

Let's match 'pwm' against Human chr1:

   library(BSgenome.Hsapiens.UCSC.hg18)
   chr1 <- Hsapiens$chr1

Number of "best" matches:

   > countPWM(pwm, chr1, min.score="100%")  # takes about 5 seconds on my system
   [1] 5152

With a lower cut-off value:

   m <- matchPWM(pwm, chr1, min.score="90%")

See the 10 first matches ("first" means "smallest chromosome location", NOT "best"
matches):

   > m[1:10]
     Views on a 247249719-letter DNAString subject
   subject: TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
   views:
        start   end width
    [1] 31931 31938     8 [GTAAACAA]
    [2] 33324 33331     8 [GTAAACAT]
    [3] 38425 38432     8 [GTAAACAG]
    [4] 39177 39184     8 [GTAAACAC]
    [5] 46971 46978     8 [GTAAACAT]
    [6] 49952 49959     8 [GTAAACAT]
    [7] 70381 70388     8 [GTAAACAG]
    [8] 74359 74366     8 [GTAAACAC]
    [9] 90714 90721     8 [GTAAACAT]
   [10] 96544 96551     8 [GTAAACAC]

The speed could be improved, maybe by a factor 2 (or more, for longest PWMs).

Also maybe an additional argument could be added to let the user control how
the returned matches should be sorted ("left-to-right" or "best-first")?

Unlike MatInspector or the transfac-tool, there is no facility yet to suggest
individual cut-off values depending on the length of the PWMs.

See '?matchPWM' for more information (e.g. how to search the minus strand of
the chromosome).

Please let me know how it works for you.

Cheers,
H.


Hans-Ulrich Klein wrote:
> Hi Herve,
> 
> Herve Pages wrote:
>> Hans-Ulrich Klein wrote:
>>> I want to locate transcription factor binding sites (tfbs) within a 
>>> given sequence. The tfbs are derived from databases like transfac or 
>>> jaspar and are described by matrices. Are there algorithms for 
>>> locating tfbs matches (e.g. "matinspector") implemented in 
>>> bioconductor? I could not find one.
>>
>> I assume that your matrices are Position Weight Matrices?
> 
> yes. I meant position weight matrices.
> 
>> There is no facility in the Biostrings package for matching PWM to
>  > a DNA sequence but that would be easy to add. In fact, I've
>> already fully described how to implement such facility
>> in a separate package and on top of Biostrings basic containers (i.e. 
>> DNAString objects) during the lab I gave for the "Advanced R
>> for Bioinformatics" course back in February this year:
>>
>>   http://bioconductor.org/workshops/2008
>>
>> Follow "Advanced R for Bioinformatics" ->  "Interfaces to C (Lab)"
>>
>> The simpleMatchPWM_0.99.0.tar.gz package contains the matchPWM() 
>> function for finding all matches of a PWM in a given sequence.
>> Unfortunately, the package was depending on a devel version
>> of Biostrings that has changed since then, and
>> those changes broke simpleMatchPWM 0.99.0. Let me know if this is what 
>> you are looking for and I'll fix the package (this should
>> be straightforward).
> 
> It is quite close to what I am looking for. I have access to the 
> transfac database including a web based tool for finding PWM matches. I 
> am looking for an alternative to the web tool in R for two reasons:
> 
> 1. I have done preceding analysis in R and will do follow-up analysis in 
> R. It would be nice to avoid the effort for data export and import.
> 2. I have not found a detailed description of the algorithm used by the 
> web tool.
> 
> So simpleMatchPWM is at least a good starting point, as it does all the 
> basic score computations. Why not integrate the matchPWM function in the 
> Biostring package? I would appreciate it.
> However, most algorithms (like MatInspector or the transfac-tool) 
> implement some heuristics to improve results. E.g., they suggest 
> individual cut-off values depending on the length of the pwms. I am not 
> sure whether I have enough time and knowledge to add such functionalities.
> 
> Best wishes,
> Hans-Ulrich
> 
> PS: Has someone experiences with the bioperl package "TFBS"?



More information about the Bioconductor mailing list