[BioC] PDict in GenomicRanges

Hervé Pagès hpages at fhcrc.org
Sun Aug 5 00:34:30 CEST 2012


Dear Haiying,

On 08/04/2012 05:33 AM, Haiying Kong wrote:
> Dear Dr. Pages,
>
>    I am trying to create a PDict object with a DNAStringSet to use on
> countPDict function. The sequences in the DNAStringSet are having
> different lengths. It seems like this causes a problem.
>    Should I classify the sequences in the DNAStringSet with their
> lengths and create multiple number of PDict objects?

That would work but you could also preprocess only the rectangular
DNAStringSet obtained by trimming the sequences to the length of
the shortest. This implies 2 things: (1) the shortest sequence
is not too short otherwise performance will degrade a lot (depending
on the number of "patterns" in your DNAStringSet and the length of
the "subject", should be at least 10 nucleotides for 1 million patterns
to match against Human chr1), (2) you want to do exact matching.

If those 2 conditions are satisfied, then it's very easy:

   > dna <- DNAStringSet(c("AACAT", "TCCAGAAAA", "GTCCACA"))
   > dna
     A DNAStringSet instance of length 3
       width seq
   [1]     5 AACAT
   [2]     9 TCCAGAAAA
   [3]     7 GTCCACA

   > pdict <- PDict(dna, tb.end=min(width(dna)))
   > pdict
   TB_PDict object of length 3 and variable width (preprocessing 
algo="ACtree2"):
     - with NO head
     - with a Trusted Band of width 5
     - with a tail of variable width (min=0 / max=4)

Here is what the above means. The original DNAStringSet was
split into 3 pieces:

            Trusted
      head \ Band  / tail
            |     |
            |AACAT|
            |TCCAG|AAAA
            |GTCCA|CA

Only the "Trusted Band" (which is rectangular) is actually
preprocessed (i.e. stored in the Aho-Corasick tree).
Then:

   > countPDict(pdict, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"))
   [1] 1 0 2

Should take less than 3 minutes on Human chr1 with 1 million patterns,
a min pattern length >= 10 nucleotides (and an average pattern length
of 25).

This drops to less than 1 minute on Human chr1 with 1 million patterns,
a min pattern length >= 21 nucleotides (and an average pattern length
of 85).

Note that if you call matchPDict() or countPDict() with a non-zero
max.mismatch value, then the mismatches will be allowed on the tail
*only*.

Finally, note that if you don't have too many patterns (< 10000),
you can avoid the preprocessing entirely by calling matchPDict()
or countPDict() directly on your DNAStringSet object:

   > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"))
   [1] 1 0 2
   > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), 
max.mismatch=1)
   [1] 1 1 2
   > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), 
max.mismatch=2)
   [1] 4 2 3
   > countPDict(dna, DNAString("GTCCACAACATTCCAGAAAGGTCCACA"), 
max.mismatch=2, with.indels=TRUE)
   [1] 5 2 3

In that case, mismatches are allowed anywhere and you can even allow
indels if you want. Of course, this will be *much* slower than when
preprocessing with PDict().

Please let me know if you have any further questions about this.

Cheers,
H.

>
>    Thank you very much in advance.
>
>    Best Regards,
>
>    Haiying


-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list