[BioC] PDict in GenomicRanges

Sun Aug 5 01:31:14 CEST 2012

Haiying,

Let's keep this on the Bioconductor mailing list so others can give
advice.

On 08/04/2012 03:41 PM, Haiying Kong wrote:
> Dear Dr. Pagès,
>
>    Thank you very much for your explanation. It is very helpful, indeed.
>
>    Could you please help me by any chance for the other part of the work?
>    I need to split a big SFF file into a number of small SFF files.
>    I was able to read in the big SFF file as a big SFFContainer with the
> function readSFF in the Bioconductor package:R453Plus1Toolbox. I also
> fond the index to split the big SFFContainer. For example, the index
> (5,11,17,22,25,29,31,33,37) for the first small SFFContainer.
>    I couldn't figure out how to extract the
> (5,11,17,22,25,29,31,33,37)th reads from the big SSFContainer and
> construct a small SFFContainer and save it as an SFF file. Could you
> please tell me by any chance how to do this part of work? Are there any
> Bioconductor functions you could recommend?

I'm not familiar with SFF files or with the R453Plus1Toolbox package
but it doesn't seem like the package is providing something like a
writeSFF() function.

If all you want to do is split the file into smaller files, I would
check some 3rd party tools like DNA Baser SFF Workbench, which,
according to their website, can "Cut a SFF file in smaller pieces".
Don't know what kind of license they have though and what platforms
they support in addition to Windows. Anyway other people on this list
might have better suggestions.

Cheers,
H.

>
>    Thank you very much in advance.
>
>    Best Regards,
>
>    Haiying
>
> On Sun, Aug 5, 2012 at 6:34 AM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Dear Haiying,
>
>
>     On 08/04/2012 05:33 AM, Haiying Kong wrote:
>
>         Dear Dr. Pages,
>
>             I am trying to create a PDict object with a DNAStringSet to
>         use on
>         countPDict function. The sequences in the DNAStringSet are having
>         different lengths. It seems like this causes a problem.
>             Should I classify the sequences in the DNAStringSet with their
>         lengths and create multiple number of PDict objects?
>
>
>     That would work but you could also preprocess only the rectangular
>     DNAStringSet obtained by trimming the sequences to the length of
>     the shortest. This implies 2 things: (1) the shortest sequence
>     is not too short otherwise performance will degrade a lot (depending
>     on the number of "patterns" in your DNAStringSet and the length of
>     the "subject", should be at least 10 nucleotides for 1 million patterns
>     to match against Human chr1), (2) you want to do exact matching.
>
>     If those 2 conditions are satisfied, then it's very easy:
>
>        > dna <- DNAStringSet(c("AACAT", "TCCAGAAAA", "GTCCACA"))
>        > dna
>          A DNAStringSet instance of length 3
>            width seq
>        [1]     5 AACAT
>        [2]     9 TCCAGAAAA
>        [3]     7 GTCCACA
>
>        > pdict <- PDict(dna, tb.end=min(width(dna)))
>        > pdict
>        TB_PDict object of length 3 and variable width (preprocessing
>     algo="ACtree2"):
>          - with NO head
>          - with a Trusted Band of width 5
>          - with a tail of variable width (min=0 / max=4)
>
>     Here is what the above means. The original DNAStringSet was
>     split into 3 pieces:
>
>                 Trusted
>           head \ Band  / tail
>                 |     |
>                 |AACAT|
>                 |TCCAG|AAAA
>                 |GTCCA|CA
>
>     Only the "Trusted Band" (which is rectangular) is actually
>     preprocessed (i.e. stored in the Aho-Corasick tree).
>     Then:
>
>        > countPDict(pdict, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"))
>        [1] 1 0 2
>
>     Should take less than 3 minutes on Human chr1 with 1 million patterns,
>     a min pattern length >= 10 nucleotides (and an average pattern length
>     of 25).
>
>     This drops to less than 1 minute on Human chr1 with 1 million patterns,
>     a min pattern length >= 21 nucleotides (and an average pattern length
>     of 85).
>
>     Note that if you call matchPDict() or countPDict() with a non-zero
>     max.mismatch value, then the mismatches will be allowed on the tail
>     *only*.
>
>     Finally, note that if you don't have too many patterns (< 10000),
>     you can avoid the preprocessing entirely by calling matchPDict()
>     or countPDict() directly on your DNAStringSet object:
>
>        > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"))
>        [1] 1 0 2
>        > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"),
>     max.mismatch=1)
>        [1] 1 1 2
>        > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"),
>     max.mismatch=2)
>        [1] 4 2 3
>        > countPDict(dna, DNAString("__GTCCACAACATTCCAGAAAGGTCCACA"),
>     max.mismatch=2, with.indels=TRUE)
>        [1] 5 2 3
>
>     In that case, mismatches are allowed anywhere and you can even allow
>     indels if you want. Of course, this will be *much* slower than when
>     preprocessing with PDict().
>
>     Please let me know if you have any further questions about this.
>
>     Cheers,
>     H.
>
>
>
>             Thank you very much in advance.
>
>             Best Regards,
>
>             Haiying
>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319