[BioC] DNASuffixArray

Herve Pages hpages at fhcrc.org
Fri Sep 5 20:41:06 CEST 2008


Hi Gabriel,

Yes the DNASuffixArray() function has been put aside during the transition
from Biostrings 1 to Biostrings 2. When I took over the package as the main
maintainer in 2005, I redesigned the whole class hierarchy and ended up
having to rewrite almost everything to make things work with the new
infrastructure. Biostrings 1 had a single R file (String.R) that I kept
around in Biostrings 2 (in the Biostrings1/ folder): it serves as a reference
for what functionalities were available in Biostrings 1 and what still needs
to be ported to the 2 series.

DNASuffixArray() and LongestCommonPrefix() are in fact the only functionalities
that are still missing. I never took the time to put them back in the 2 series,
sorry. I guess my focus was on other functionalities: the development of the
package is mostly driven by people's most urgent needs and nobody asked about
DNASuffixArray() so far. Maybe because there are other packages from CRAN or
the Omegahat projects that are dedicated to Suffix Trees and Suffix Arrays.

Anyway, with today's Biostrings much revised code, you can build the suffix
array of a DNA sequence in one line:

   sort(DNAStringSet(views(x0, start=seq_len(nchar(x0)))))

Surprisingly this is even faster than DNASuffixArray() from Biostrings 1 which
was entirely implemented in C (it used a fancy LSD-first sorting algo).
With a 10M-bp random sequence:

   x0 <- DNAString(paste(sample(c("A", "C", "G", "T"), 10000000, replace=TRUE), collapse=""))

   With R-2.7.1 + Biostrings 1.4.0:

     sa1 <- DNASuffixArray(x0)
     ==> Takes: 327 sec. on a laptop with 1GB /  242 sec. on a 64-bit Linux server

   With R-2.8 + Biostrings 2.9.69:

     sa2 <- sort(DNAStringSet(views(x0, start=seq_len(nchar(x0)))))
     ==> Takes: 77 sec. on a laptop with 1GB / 24.5 sec. on a 64-bit Linux server

So depending on the hardware, the Biostrings 2 solution is between 4 and 10
times faster than DNASuffixArray().

Note that:
  o The "sort" method for XStringSet objects was just added yesterday to the
    devel version of Biostrings (2.9.69). It uses qsort() from the standard C
    lib. Other related methods ("duplicated", "unique", "order", etc) will
    follow soon.
  o The order defined between DNA sequences has changed between Biostrings 1
    and Biostrings 2:
      in Biostrings 1: T < G < C < A
      in Biostrings 2: A < C < G < T
    so elements in 'sa1' and 'sa2' are ordered differently.
  o The sort-based solution is not limited to DNA sequences (DNAString objects):
    it also works with BString, RNAString and AAString objects.

I'll put DNASuffixArray() and LongestCommonPrefix() back in the package in
the next days. Then all Biostrings 1 functionalities will be back and I will
finally be able to remove the Biostrings1/ folder.

Cheers,
H.


Gabriel Valiente wrote:
> I post it again, in the hope Herve Pages is already back. Thanks,
> 
> Gabriel
> 
>> The guy who is most on top of this is undoubtedly Herve Pages from 
>> FHCRC, since he wrote it. He is on vacation until Sep. 1st. Of course, 
>> there may be other people with details on this.
>>
>> Kasper
>>
>> On Aug 29, 2008, at 10:35 AM, Gabriel Valiente wrote:
>>
>>> Dear all,
>>>
>>> there was a DNASuffixArray function call to C code (biostring.c) in 
>>> package
>>>
>>> Biostrings 1.1.1 (BioConductor 1.6)
>>> Biostrings 1.4.0 (BioConductor 1.7)
>>> Biostrings 2.0.3 (BioConductor 1.8)
>>> Biostrings 2.2.1 (BioConductor 1.9)
>>>
>>> but then, still a function call but no C code anymore, in package
>>>
>>> Biostrings 2.4.8 (BioConductor 2.0)
>>> Biostrings 2.6.6 (BioConductor 2.1)
>>> Biostrings 2.8.18 (BioConductor 2.2)
>>> Biostrings 2.9.66 (BioConductor 2.3)
>>>
>>> Can someone please briefly explain what happened to DNASuffixArray? 
>>> Beside being interested in using it, I would also like to properly 
>>> cite this implementation in a forthcoming book and thus, need to be 
>>> sure about any details. Thanks,
>>>
>>> Gabriel
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list