[BioC] Comparing DNAStringSetLists

Wed Oct 16 07:45:24 CEST 2013

Hi Vince,

Sounds like maybe you have a use case for a pintersect method for
DNAStringSetList objects:

   setMethod("pintersect", c("DNAStringSetList", "DNAStringSetList"),
     function(x, y, ...)
     {
         if (length(x) != length(y))
             stop("'x' and 'y' must have the same length")

         ux <- unlist(x, use.name=FALSE)
         uy <- unlist(y, use.name=FALSE)
         string_id <- selfmatch(c(ux, uy))
         ux_id <- string_id[seq_along(ux)]
         uy_id <- string_id[seq_along(uy) + length(ux)]
         ux_group <- rep.int(seq_along(x), elementLengths(x))
         uy_group <- rep.int(seq_along(y), elementLengths(y))
         m <- IRanges:::matchIntegerPairs(ux_group, ux_id, uy_group, uy_id)
         keep_idx <- which(!is.na(m))
         ux <- ux[keep_idx]
         ux_group <- ux_group[keep_idx]
         ux_id <- ux_id[keep_idx]
         sm <- IRanges:::selfmatchIntegerPairs(ux_group, ux_id)
         keep_idx <- sm == seq_along(sm)
         ux <- ux[keep_idx]
         ux_group <- ux_group[keep_idx]
         ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x))
         relist(ux, ans_skeleton)
     }
   )

Then:

   > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A", "T"), 
c("T", "G"))

   > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"), "C", 
c("G", "T"))

   > pintersect(alleles1, alleles2)
   DNAStringSetList of length 4
   [[1]] A
   [[2]] A
   [[3]]   A DNAStringSet instance of length 0
   [[4]] T G

Should take about 20 sec. on your 12-million long vectors.

Then use elementLengths() on this result to get the lengths of the
intersecting sets.

HTH,
H.

On 10/15/2013 04:16 PM, Vince S. Buffalo wrote:
> Hi All,
>
> I have two vectors of alleles stored as DNAStringSetLists. For each element
> in both lists, I need to find the length of the intersecting set. Using
> mapply() and intersect() take too long, as does sapply(dna.set.list,
> as.character) (and then using mclapply or lapply to find intersect on
> characters). Is there a fast way to do this? I have vectors ~12 million
> rows long.
>
> Vince
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319