[BioC] PWMmatch: position weight matrix or position frequency matrix?

Hervé Pagès hpages at fhcrc.org
Tue Feb 22 19:59:18 CET 2011


Hi Zuzanna,

On 02/17/2011 06:59 AM, Zuzanna Makowska wrote:
> Dear List,
>
> I have a question regarding the matchPWM function of Biostrings package.
>
> The help page for the function states that it requires a position weight matrix as an input. At the same time I found a post on the list giving a following example of the use of the function:
>
> (quoting:
> 		[BioC] matching transcription factor binding sites
>
> 		Herve Pages hpages at fhcrc.org
>
> 		Sat Apr 19 02:41:03 CEST 2008)
>
> 		Suppose 'pwm' contains a Position Weight Matrix, let's say:
>
> 		pwm<- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),
>                  		C=c( 1,  0,  1,  0,  1, 18,  0,  2),
>                  		G=c(17,  0,  0,  0,  1,  0,  0,  3),
>                  		T=c( 1, 20,  0,  0,  0,  1,  0,  8))
>
> 		Note that this is just a standard integer matrix with the 4 DNA base letters
> 		as row names (having these row names is mandatory).
> 		m<- matchPWM(pwm, chr1, min.score="90%")
>
> It seems to me that the matrix in this example is a position frequency matrix and not a position weight matrix (the difference between the two is explained nicely in: Applied Bioinformatics for the identification of regulatory elements; WW Wasserman&A Sandelin, Nat Rev Genet, 2004).

Thanks for the pointer to Wasserman & Sandelin's paper.

I confirm that the matchPWM() function expects the input to be
a position *weight* matrix. What makes the 'pwm' object above maybe
look like a position *frequency* matrix is because, unlike in
Wasserman's paper, it contains non-negative integer weights.
Furthermore, all the columns sum to the same value:

   > colSums(pwm)
   [1] 20 20 20 20 20 20 20 20

I understand that this is indeed misleading.

But 'pwm' could also be something like:

pwm <- rbind(A=c(0.06, -0.02,  0.30),
              C=c(0.00,  0.17,  0.00),
              G=c(0.03,  0.05,  0.12),
              T=c(0.22, -0.01,  0.08))

It is really treated by the matchPWM() function as a position-specific
scoring matrices. You can check this by computing the score for a few
given positions:

 > PWMscoreStartingAt(pwm, DNAString("TTCAA"), starting.at=1:3)
[1] 0.21 0.69 0.28

Then, as you can see, matchPWM() returns the match corresponding to the
position that produces the highest score:

 > matchPWM(pwm, DNAString("TTCAA"))
   Views on a 5-letter DNAString subject
subject: TTCAA
views:
     start end width
[1]     2   4     3 [TCA]

Finally note that the Biostrings package doesn't provide a tool
to convert a position frequency matrix (that can be obtained with
consensusMatrix) into a position weight matrix.

Hope this helps,
H.

>
> Could somebody clarify what is the expected input for this function?
>
> Thanks in advance,
>
> Zuzanna Makowska
>
>
>
>
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list