[BioC] Comparing DNAStringSetLists

Wed Oct 16 18:20:23 CEST 2013

Right, this is actually quite generic and would work on any
CompressedList derivative with an elementType that supports c(), [,
and match() (by default sefmatch() will just call match()). So it
should go to IRanges rather than Biostrings. The other parallel set
operations (punion, psetdiff) could be implemented in the same manner.
Added to my TODO list.

Yes, togroup(x) does rep.int(seq_along(x), elementLengths(x)) but
it's cleaner to use the former.

H.

On 10/16/2013 06:31 AM, Michael Lawrence wrote:
> And I guess this also works for CharacterList, and a bit easier for
> IntegerList.
>
>
> On Wed, Oct 16, 2013 at 6:27 AM, Michael Lawrence <michafla at gene.com
> <mailto:michafla at gene.com>> wrote:
>
>     Neat. I was just going to suggest using matchIntegerPairs().
>
>     You could have used togroup(x) for this right?
>     ux_group <- rep.int <http://rep.int>(seq_along(x), elementLengths(x))
>
>
>     On Tue, Oct 15, 2013 at 10:45 PM, Hervé Pagès <hpages at fhcrc.org
>     <mailto:hpages at fhcrc.org>> wrote:
>
>         Hi Vince,
>
>         Sounds like maybe you have a use case for a pintersect method for
>         DNAStringSetList objects:
>
>            setMethod("pintersect", c("DNAStringSetList",
>         "DNAStringSetList"),
>              function(x, y, ...)
>              {
>                  if (length(x) != length(y))
>                      stop("'x' and 'y' must have the same length")
>
>                  ux <- unlist(x, use.name <http://use.name>=FALSE)
>                  uy <- unlist(y, use.name <http://use.name>=FALSE)
>                  string_id <- selfmatch(c(ux, uy))
>                  ux_id <- string_id[seq_along(ux)]
>                  uy_id <- string_id[seq_along(uy) + length(ux)]
>                  ux_group <- rep.int <http://rep.int>(seq_along(x),
>         elementLengths(x))
>                  uy_group <- rep.int <http://rep.int>(seq_along(y),
>         elementLengths(y))
>                  m <- IRanges:::matchIntegerPairs(__ux_group, ux_id,
>         uy_group, uy_id)
>                  keep_idx <- which(!is.na <http://is.na>(m))
>                  ux <- ux[keep_idx]
>                  ux_group <- ux_group[keep_idx]
>                  ux_id <- ux_id[keep_idx]
>                  sm <- IRanges:::__selfmatchIntegerPairs(ux___group, ux_id)
>                  keep_idx <- sm == seq_along(sm)
>                  ux <- ux[keep_idx]
>                  ux_group <- ux_group[keep_idx]
>                  ans_skeleton <- PartitioningByEnd(ux_group, NG=length(x))
>                  relist(ux, ans_skeleton)
>              }
>            )
>
>         Then:
>
>            > alleles1 <- DNAStringSetList("A", c("C", "A"), c("G", "A",
>         "T"), c("T", "G"))
>
>            > alleles2 <- DNAStringSetList(c("T", "A", "G"), c("A", "G"),
>         "C", c("G", "T"))
>
>            > pintersect(alleles1, alleles2)
>            DNAStringSetList of length 4
>            [[1]] A
>            [[2]] A
>            [[3]]   A DNAStringSet instance of length 0
>            [[4]] T G
>
>         Should take about 20 sec. on your 12-million long vectors.
>
>         Then use elementLengths() on this result to get the lengths of the
>         intersecting sets.
>
>         HTH,
>         H.
>
>
>
>         On 10/15/2013 04:16 PM, Vince S. Buffalo wrote:
>
>             Hi All,
>
>             I have two vectors of alleles stored as DNAStringSetLists.
>             For each element
>             in both lists, I need to find the length of the intersecting
>             set. Using
>             mapply() and intersect() take too long, as does
>             sapply(dna.set.list,
>             as.character) (and then using mclapply or lapply to find
>             intersect on
>             characters). Is there a fast way to do this? I have vectors
>             ~12 million
>             rows long.
>
>             Vince
>
>
>         --
>         Hervé Pagès
>
>         Program in Computational Biology
>         Division of Public Health Sciences
>
>         Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N, M1-B514
>         P.O. Box 19024
>         Seattle, WA 98109-1024
>
>         E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319