[BioC] countMatches() (was: table for GenomicRanges)

Hervé Pagès hpages at fhcrc.org
Mon Jan 7 20:00:22 CET 2013


Hi Michael,

I don't think "match" (the word) always has to mean "equality" either.
However having match() (the function) do "whole exact matching" (aka
"equality") for any kind of vector-like object has the advantage of:

   (a) making it consistent with base::match() (?base::match is pretty
       explicit about what the contract of match() is)

   (b) preserving its relationship with ==, duplicated(), unique(),
       etc...

   (c) not frustrating the user who needs something to do exact
       matching on ranges (as I mentioned previously, if you take
       match() away from him/her, s/he'll be left with nothing).

IMO those advantages counterbalance *by far* the very little
convenience you get from having 'match(query, subject)' do
'findOverlaps(query, subject, select="first")' on
IRanges/GRanges objects. If you need to do that, just use the
latter, or, if you think that's still too much typing, define
a wrapper e.g. 'ovmatch(query, subject)'.

There are plenty of specialized tools around for doing
inexact/fuzzy/partial/overlap matching for many particular types
of vector-like objects: grep() and family, pmatch(), charmatch(),
agrep(), grepRaw(), matchPattern() and family, findOverlaps() and
family, findIntervals(), etc... For the reasons I mentioned
above, none of them should hijack match() to make it do some
particular type of inexact matching on some particular type of
objects. Even if, for that particular type of objects, doing that
particular type of inexact matching is more common than doing
exact matching.

H.


On 01/06/2013 05:39 PM, Michael Lawrence wrote:
> I think having overlapsAny is a nice addition and helps make the API
> more complete and explicit. Are you sure we need to change the behavior
> of the match method for this relatively uncommon use case?

Yes because otherwise users with a use case of doing match()

even if it's uncommon,

> I don't think
> "match" always has to mean "equality". It is a more general concept in
> my mind. The most common use case for matching ranges is overlap.

Of course "match" doesn't always have to mean equality. But of base

>
> Michael
>
>
> On Fri, Jan 4, 2013 at 8:34 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
>     Yes 'peaks %in% genes' is cute and was probably doing the right thing
>     for most users (although not all). But 'exons %in% genes' is cute too
>      and was probably doing the wrong thing for all users. Advanced users
>     like you guys would have no problem switching to
>
>        !is.na <http://is.na>(findOverlaps(peaks, genes, type="within",
>     select="any"))
>
>     or
>
>        !is.na <http://is.na>(findOverlaps(peaks, genes, type="equal",
>     select="any"))
>
>     in case 'peaks %in% genes' was not doing exactly what you wanted,
>     but most users would not find this particularly friendly. Even
>     worse, some users probably didn't realize that 'peaks %in% genes'
>     was not doing exactly what they thought it did because "peaks in
>     genes" in English suggests that the peaks are within the genes,
>     but it's not what 'peaks %in% genes' does.
>
>     Having overlapsAny(), with exactly the same extra arguments as
>     countOverlaps() and subsetByOverlaps() (i.e. 'maxgap', 'minoverlap',
>     'type', 'ignore.strand'), all of them documented (and with most
>     users more or less familiar with them already) has the virtue to
>     expose the user to all the options from the very start, and to
>     help him/her make the right choice. Of course there will be users
>     that don't want or don't have the time to read/think about all the
>     options. Not a big deal: they'll just do 'overlapsAny(query, subject)',
>     which is not a lot more typing than 'query %in% subject', especially
>     if they use tab completion.
>
>     It's true that it's more common to ask questions about overlap than
>     about equality but there are some use cases for the latter (as the
>     original thread shows). Until now, when you had such a use case, you
>     could not use match() or %in%, which would have been the natural things
>     to use, because they got hijacked to do something else, and you were
>     left with nothing. Not a satisfying situation. So at a minimum, we
>     needed to restore the true/real/original semantic of match() to do
>     "equality" instead of "overlap". But it's hard to do this for match()
>     and not do it for %in% too. For more than 99% of R users, %in% is
>     just a simple wrapper for 'match(x, table, nomatch = 0) > 0' (this
>     is how it has been documented and implemented in base R for many
>     years). Not maintaining this relationship between %in% and match()
>     would only cause grief and frustration to newcomers to Bioconductor.
>
>     H.
>
>
>
>     On 01/04/2013 03:32 PM, Cook, Malcolm wrote:
>
>         Hiya again,
>
>         I am definitely a late comer to BioC, so I definitely easily
>         defer to
>         the tide of history.
>
>         But I do think you miss my point Michael about the proposed change
>         making the relationship between %in% and match for {G,I}Ranges{List}
>         mimic that between other vectors, and I do think that changing
>         the API
>         would make other late-comers take to BioC easier/faster.
>
>         That said, I NEVER use %in% so I really have no stake in the
>         matter, and
>         I DEFINITELY appreciate the argument to not changing the API
>         just for
>         sematic sweetness.
>
>         That that said, Herve is _/so good/_ about deprecations and warnings
>
>         that make such changes fairly easily digestible.
>
>         That that that.... enough.... I bow out of this one....!!!!
>
>         Always learning and Happy New Year to all lurkers,
>
>         ~Malcolm
>
>         *From:*Michael Lawrence [mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>]
>         *Sent:* Friday, January 04, 2013 5:11 PM
>         *To:* Cook, Malcolm
>         *Cc:* Sean Davis; Michael Lawrence; Hervé Pagès
>         (hpages at fhcrc.org <mailto:hpages at fhcrc.org>); Tim
>
>         Triche, Jr.; Vedran Franke; bioconductor at r-project.org
>         <mailto:bioconductor at r-project.org>
>         *Subject:* Re: [BioC] countMatches() (was: table for GenomicRanges)
>
>
>         On Fri, Jan 4, 2013 at 1:56 PM, Cook, Malcolm <MEC at stowers.org
>         <mailto:MEC at stowers.org>
>         <mailto:MEC at stowers.org <mailto:MEC at stowers.org>>> wrote:
>
>         Hiya,
>
>         For what it is worth...
>
>         I think the change to %in% is warranted.
>
>         If I understand correctly, this change restores the relationship
>         between
>         the semantics of `%in` and the semantics of `match`.
>
>           From the docs:
>
>             '"%in%" <- function(x, table) match(x, table, nomatch = 0) > 0'
>
>         Herve's change restores this relationship.
>
>
>         match and %in% were initially consistent (both considering any
>         overlap);
>         Herve has changed both of them together. The whole idea behind
>         IRanges
>         is that ranges are special data types with special semantics. We
>         have
>         reimplemented much of the existing R vector API using those
>         semantics;
>         this extends beyond match/%in%. I am hesitant about making such
>         sweeping
>         changes to the API so late in the life-cycle of the package.
>         There was a
>         feature request for a way to count identical ranges in a set of
>         ranges.
>         Let's please not get carried away and start redesigning the API
>         for this
>         one, albeit useful, request. There are all sorts of
>         inconsistencies in
>         the API, and many of them were conscious decisions that considered
>         practical use cases.
>
>         Michael
>
>
>              Herve, I suspect you were you as a result able to
>         completely drop
>              all the `%in%,BiocClass1,BiocClass2` definitions and depend
>         upon
>              base::%in%
>
>              Am I right?
>
>              If so, may I suggest that Herve stay the course, with the
>         addition of
>                 '"%ol%" <- function(a, b) findOverlaps(a, b, maxgap=0L,
>              minoverlap=1L, type='any', select='all') > 0'
>
>              This would provide a perspicacious idiom, thereby
>         optimizing the API
>              for Michaels observed common use case.
>
>              Just sayin'
>
>              ~Malcolm
>
>
>                .-----Original Message-----
>                .From: bioconductor-bounces at r-__project.org
>         <mailto:bioconductor-bounces at r-project.org>
>              <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>
>              [mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>
>
>              <mailto:bioconductor-bounces at __r-project.org
>         <mailto:bioconductor-bounces at r-project.org>>] On Behalf Of Sean
>         Davis
>                .Sent: Friday, January 04, 2013 3:37 PM
>                .To: Michael Lawrence
>                .Cc: Tim Triche, Jr.; Vedran Franke;
>         bioconductor at r-project.org <mailto:bioconductor at r-project.org>
>              <mailto:bioconductor at r-__project.org
>         <mailto:bioconductor at r-project.org>>
>
>                .Subject: Re: [BioC] countMatches() (was: table for
>         GenomicRanges)
>                .
>                .On Fri, Jan 4, 2013 at 4:32 PM, Michael Lawrence
>                .<lawrence.michael at gene.com
>         <mailto:lawrence.michael at gene.com>
>         <mailto:lawrence.michael at gene.__com
>         <mailto:lawrence.michael at gene.com>>> wrote:
>                .> The change to the behavior of %in% is a pretty big
>         one. Are you
>              thinking
>                .> that all set-based operations should behave this way? For
>              example, setdiff
>                .> and intersect? I really liked the syntax of "peaks
>         %in% genes".
>              In my
>                .> experience, it's way more common to ask questions
>         about overlap
>              than about
>                .> equality, so I'd rather optimize the API for that use
>         case. But
>              again,
>                .> that's just my personal bias.
>                .
>                .For what it is worth, I share Michael's personal bias here.
>                .
>                .Sean
>                .
>                .
>                .> Michael
>                .>
>                .>
>                .> On Fri, Jan 4, 2013 at 1:11 PM, Hervé Pagès
>         <hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>              <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>> wrote:
>                .>
>                .>> Hi,
>                .>>
>                .>> I added findMatches() and countMatches() to the
>         latest IRanges /
>                .>> GenomicRanges packages (in BioC devel only).
>                .>>
>                .>>   findMatches(x, table): An enhanced version of
>         ‘match’ that
>                .>>           returns all the matches in a Hits object.
>                .>>
>                .>>   countMatches(x, table): Returns an integer vector
>         of the length
>                .>>           of ‘x’, containing the number of matches in
>         ‘table’ for
>                .>>           each element in ‘x’.
>                .>>
>
>                .>> countMatches() is what you can use to
>         tally/count/tabulate
>              (choose your
>
>                .>> preferred term) the unique elements in a GRanges object:
>                .>>
>                .>>   library(GenomicRanges)
>                .>>   set.seed(33)
>                .>>   gr <- GRanges("chr1",
>         IRanges(sample(15,20,replace=*__*TRUE),
>              width=5))
>                .>>
>                .>> Then:
>                .>>
>                .>>   > gr_levels <- sort(unique(gr))
>                .>>   > countMatches(gr_levels, gr)
>                .>>    [1] 1 1 1 2 4 2 2 1 2 2 2
>                .>>
>                .>> Note that findMatches() and countMatches() also work on
>              IRanges and
>                .>> DNAStringSet objects, as well as on ordinary atomic
>         vectors:
>                .>>
>                .>>   library(hgu95av2probe)
>                .>>   library(Biostrings)
>                .>>   probes <- DNAStringSet(hgu95av2probe)
>                .>>   unique_probes <- unique(probes)
>                .>>   count <- countMatches(unique_probes, probes)
>                .>>   max(count)  # 7
>                .>>
>                .>> I made other changes in IRanges/GenomicRanges so that
>         the notion
>                .>> of "match" between elements of a vector-like object now
>              consistently
>                .>> means "equality" instead of "overlap", even for
>         range-based
>              objects
>                .>> like IRanges or GRanges objects. This notion of
>         "equality" is the
>                .>> same that is used by ==. The most visible consequence
>         of those
>                .>> changes is that using %in% between 2 IRanges or
>         GRanges objects
>                .>> 'query' and 'subject' in order to do overlaps was
>         replaced by
>                .>> overlapsAny(query, subject).
>                .>>
>                .>>   overlapsAny(query, subject): Finds the ranges in
>         ‘query’ that
>                .>>      overlap any of the ranges in ‘subject’.
>                .>>
>
>                .>> There are warnings and deprecation messages in place
>         to help
>              smooth
>
>                .>> the transition.
>                .>>
>                .>> Cheers,
>                .>> H.
>                .>>
>                .>> --
>                .>> Hervé Pagès
>                .>>
>                .>> Program in Computational Biology
>                .>> Division of Public Health Sciences
>                .>> Fred Hutchinson Cancer Research Center
>                .>> 1100 Fairview Ave. N, M1-B514
>                .>> P.O. Box 19024
>                .>> Seattle, WA 98109-1024
>                .>>
>                .>> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>         <mailto:hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>                .>> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>         <tel:%28206%29%20667-5791>
>                .>> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>         <tel:%28206%29%20667-1319>
>
>                .>>
>                .>
>                .>         [[alternative HTML version deleted]]
>                .>
>                .>
>                .> _________________________________________________
>                .> Bioconductor mailing list
>                .> Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>
>                .> https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                .> Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>                .
>                ._________________________________________________
>                .Bioconductor mailing list
>                .Bioconductor at r-project.org
>         <mailto:Bioconductor at r-project.org>
>         <mailto:Bioconductor at r-__project.org
>         <mailto:Bioconductor at r-project.org>>
>
>                .https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>                .Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>     --
>     Hervé Pagès
>
>     Program in Computational Biology
>     Division of Public Health Sciences
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N, M1-B514
>     P.O. Box 19024
>     Seattle, WA 98109-1024
>
>     E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>     Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
>     Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list