[BioC] Creat PWM from Alignment with Gaps

Hervé Pagès hpages at fhcrc.org
Tue Oct 1 20:53:03 CEST 2013

Hi Lax,

On 09/25/2013 09:52 AM, Lakshmanan Iyer wrote:
> In Bioconductor, Is it possible to create a PWM from DNA sequence alignment
> with gaps to be used to scan sequences?
> When I try it as suggested in "Biostrings" it gives the following error:
>> pwm <- PWM(pfm)
> Error in .normargPfm(x) :
>    invalid PFM 'x': IUPAC ambiguity letters are represented
> The help for "PWM" does say "with no IUPAC ambiguity letters" and Gaps are
> ambiguity letters.

Gaps in alignments are not represented with ambiguity letters but with
the "-" letter:

   > library(BSgenome.Dmelanogaster.UCSC.dm3)
   > pairwiseAlignment("ACCACTATCCATTACGGGCCCAGT", 
unmasked(Dmelanogaster$chr2L), type="global-local")
   Global-Local PairwiseAlignmentsSingleSubject (1 of 1)
   subject: [1216159] ACCACTAT-CATTACGG--CCAGT
   score: 9.616879

So let's say you want to compute the consensus matrix from a
collection of DNA sequences that contain alignment gaps:

   > dna <- DNAStringSet(c("CTAATGG--CCAGT",
   > cm <- consensusMatrix(dna)
   > dim(cm)
   [1] 17 14
   > rownames(cm)
    [1] "A" "C" "G" "T" "M" "R" "W" "S" "Y" "K" "V" "H" "D" "B" "N" "-" "+"

If you try to pass this matrix directly to PWM(), it will be rejected
with the error you reported because PWM() only accepts a position
frequency matrix with 4 rows: 1 for each DNA base letter.

   > PWM(cm)
   Error in .normargPfm(x) :
     invalid PFM 'x': IUPAC ambiguity letters are represented

I realize the error message is mis-leading (because "-" is not an
ambiguity letter) and I'm going to modify it to make it a little bit
more informative.

To work around this, one might want to subset 'cm' to keep only the
A, C, G, and T rows before passing it to PWM(). Unfortunately that
doesn't work either because the Wasserman & Sandelin algo used
internally by PWM() to compute a Position Weight Matrix from a Position
Frequency Matrix expects the latter to have all its columns sum up to
the same value:

   > PWM(cm[DNA_BASES, ])
   Error in .normargPfm(x) :
     invalid PFM 'x': all columns in 'x' must sum to the same value.
     If the PFM was obtained by calling consensusMatrix() on a DNAStringSet
     object, please make sure that this object is rectangular (i.e. has a
     constant width).

To me this sounds like a legitimate situation for bypassing the
pfm-to-pwm transformation and use the pfm directly as a pwm:

   pwm <- cm[DNA_BASES, ]
   matchPWM(pwm, subject)

Hope this helps,

>> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-apple-darwin10.8.0 (64-bit)
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
> other attached packages:
> [1] Biostrings_2.28.0  IRanges_1.18.3     BiocGenerics_0.6.0
> loaded via a namespace (and not attached):
> [1] stats4_3.0.1 tools_3.0.1
> -best
> -Lax
> 	[[alternative HTML version deleted]]
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319

More information about the Bioconductor mailing list