[BioC] PWM matching using matchPWM in Biostrings
Hervé Pagès
hpages at fhcrc.org
Wed May 20 03:40:37 CEST 2009
Hi Kristoffer,
Not easy to read/reproduce your code since you don't show what the
c2s() or s2c() functions are. I guess c2s() is turning a character
vector into a single string, e.g. something like
c2s <- function(x) paste(x, collapse="")
and s2c() is exploding it e.g. something like
s2c <- function(x) strsplit(x, NULL, fixed=TRUE)[[1]]
more below...
Kristoffer Rigbolt wrote:
> Hi Bioconductor
>
>
>
> I was pleased to find the matchPWM function in the Biostrings package,
> but I have one fundamental question to the function, or maybe to the
> general approach.
>
>
>
> To test the specificity of the approach I started out from the example
> given in the matchPWM vignette:
>
>
>
> pwm <- rbind(A=c( 1, 0, 19, 20, 18, 1, 20, 7),
>
> C=c( 1, 0, 1, 0, 1, 18, 0, 2),
>
> G=c(17, 0, 0, 0, 1, 0, 0, 3),
>
> T=c( 1, 20, 0, 0, 0, 1, 0, 8))
>
>
>
> chr3R <- unmasked(Dmelanogaster$chr3R)
>
>
>
> Next, I generated a randomized DNA string with nucleotide frequencies
> identical to chr3R:
>
>
>
> propBP <- prop.table(table(s2c(as.character(chr3R))))
Note that alphabetFrequency(chr3R, baseOnly=TRUE, freq=TRUE)[1:4]
is *much* faster!
>
> chr3R.random <- sample(names(propBP), length(s2c(as.character(chr3R))),
> TRUE, propBP)
>
> chr3R.random <- DNAString(c2s(chr3R.random))
>
>
>
> To extract the number of occurrences of the PWM in the two DNA strings
> as a function of the minimum score threshold I used this for-loop:
>
>
>
> outMat <- cbind(seq(80,100,2), NA, NA)
>
> for(i in 1:nrow(outMat)){
>
> outMat[i,2] <- countPWM(pwm, chr3R, min.score=c2s(c(outMat[i,1],
> "%")))
>
> outMat[i,3] <- countPWM(pwm, chr3R.random,
> min.score=c2s(c(outMat[i,1], "%")))
>
> }
>
>
>
> As expected the number of matches decrease as the minimum score are
> increased, but much to my surprise an almost identical number of matches
> of the PWM in the D. melanogaster DNA string and the randomized DNA
> string occurred. I believe this corresponds to a FDR of 100%(!).
The 'pwm' given in the man page of matchPWM() (and that you are using here)
was chosen arbitrarily and is very short i.e. it has a small number of cols.
More precisely the consensus string associated with this 'pwm' is GTAAACAT
and the distribution of such a small pattern in a genome of the size
of the Fly genome can almost be considered random.
So I'm not too surprised that there is no significant difference between
chr3R and your chr3R.random.
>
>
>
>> outMat
>
> min.score chr3R randomized
>
> [1,] 80 52695 51122
>
> [2,] 82 52695 51122
>
> [3,] 84 43225 39996
>
> [4,] 86 27684 24943
>
> [5,] 88 14037 10459
>
> [6,] 90 2836 2435
>
> [7,] 92 2836 2435
>
> [8,] 94 2836 2435
>
> [9,] 96 2836 2435
>
> [10,] 98 1878 1376
>
> [11,] 100 660 722
>
>
>
> Based on this observation my question is whether this issue relates to a
> property of the matchPWM function itself or if it is an inherent
> property of using PWMs to predict transcription factor binding sites? If
> the latter is the case what is then the scientific value of this
> approach.
I don't think this is a property of the matchPWM() function itself.
You should get almost the same if you counted the occurences of GTAAACAT
with countPattern().
I think this is just due to the length of the pattern: there *must*
be a lot of occurences of such a short pattern. Furthermore, if you
counted the occurences of all the possible 7-mers in a given chromosome
(with oligonucleotideFrequency()), you would always get numbers very
close to what you get with a fake (i.e. randomly-generated) chromosome.
As long as the length of the chromosome is much bigger than 4^7 (at
least 10x or 20x bigger).
Cheers,
H.
>
>
>
> Hope you can help clarify this issue.
>
>
>
> Best regards
>
> Kristoffer Rigbolt
>
> University of Southern Denmark
>
>
>
>> sessionInfo()
>
> R version 2.9.0 (2009-04-17)
>
> i386-pc-mingw32
>
>
>
> locale:
>
> LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United
> Kingdom.1252;LC_MONETARY=English_United
> Kingdom.1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252
>
>
>
> attached base packages:
>
> [1] stats graphics grDevices utils datasets methods base
>
>
>
>
> other attached packages:
>
> [1] seqinr_2.0-3
>
> [2] BSgenome.Dmelanogaster.UCSC.dm3_1.3.11
>
> [3] BSgenome_1.10.5
>
> [4] Biostrings_2.10.22
>
> [5] IRanges_1.0.16
>
>
>
> loaded via a namespace (and not attached):
>
> [1] grid_2.9.0 lattice_0.17-25 Matrix_0.999375-26 tools_2.9.0
>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list