[BioC] Comparing DNAStringSetLists

Martin Morgan mtmorgan at fhcrc.org
Wed Oct 16 06:54:47 CEST 2013

On 10/15/2013 04:16 PM, Vince S. Buffalo wrote:
> Hi All,
> I have two vectors of alleles stored as DNAStringSetLists. For each element
> in both lists, I need to find the length of the intersecting set. Using
> mapply() and intersect() take too long, as does sapply(dna.set.list,
> as.character) (and then using mclapply or lapply to find intersect on
> characters). Is there a fast way to do this? I have vectors ~12 million
> rows long.

For a couple of hacky solutions, maybe create an index i1 into one of the lists l1

   i1 <- rep(seq_along(l1), elementLengths(l1))

then create artificial alleles that are tagged by the element id

   x1 <- paste0(unlist(l1), i1)
   x2 <- paste0(unlist(l2), rep(seq_along(l2), elementLengths(l2)))

and count how many of x1 are in x2, grouped by i1

   tabulate(i1[x1 %in% x2])

This seems to be faster than

   sum(as(l1, "CharacterList"), %in% as(l2, "CharacterList"))

(the x1 or as() could be surrounded by unique() if the elements are not already).


> Vince

Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

More information about the Bioconductor mailing list