[BioC] distances for IRanges
Kasper Daniel Hansen
kasperdanielhansen at gmail.com
Wed Jun 9 05:00:56 CEST 2010
On Tue, Jun 8, 2010 at 9:53 PM, Martin Morgan <mtmorgan at fhcrc.org> wrote:
> On 06/09/2010 02:36 AM, Kasper Daniel Hansen wrote:
>> Hi Michael
>>
>> Michael,
>>
>> Thanks for pointing out nearest and friends; I agree that this
>> function should address my question.
>>
>> Reading the man page for nearest function, might I suggest an
>> additional argument like
>> multihits = c("arbitrary", "all")
>
> yes thanks for the timely nearest hint; for me I was hoping that ties
> could be decided based on maximum overlap (though ties might still
> occur, and perhaps I could break ties myself if 'all' were returned
> without too much difficulty).
I would strongly advocate just returning the multiple hits and let the
user decide how to break ties. While I certainly see use cases where
Martin's suggestion is reasonable, what happens if a range overlap two
different ranges, which have very different width. Should "maximum
overlap" be counted in number of integers (in which case you might be
biased towards the longer of the two ranges) or should it be in
%-length or ? I think allowing for all these possibilities would make
for a very complicated interface with little gain.
One might conceive of a filter function on top of the return object
(which could work for findOverlaps as well), but as the filtering in
some cases would depend on additional metadata (say the ranges are of
different type and there is an ordering on the type so an overlap with
a certain type trumps another type), it might be a lot of
infrastructure with very little gain. And really, the filtering would
just be an apply over the sparse matrix.
Kasper
>
> Martin
>
>> with the intention that a user can get full information in case one
>> range overlaps (or ties in distance) with multiple other ranges. The
>> return value could be a sparse matrix, findOverlaps-like. I find it
>> important to know about multiple hits, especially in the case when a
>> range has multiple overlaps.
>>
>> Thanks,
>> Kasper
>>
>> On Tue, Jun 8, 2010 at 1:25 PM, Michael Lawrence
>> <lawrence.michael at gene.com> wrote:
>>> For all pairwise distances, something simple based on outer() should
>>> suffice. It might not be very space efficient, but speed should be somewhat
>>> close to optimal.
>>>
>>> What is the end goal of this? For example, the nearest() function finds
>>> nearest neighbors efficiently.
>>>
>>> You might be able to leverage findOverlaps(). For example, one can set the
>>> maximum gap between ranges to be considered overlapping. That could be set
>>> to a non-zero value representing some maximum allowable distance. The sparse
>>> doublet matrix from as.matrix() would be pretty efficient for distance
>>> calculation, via the pgap() function.
>>>
>>> Michael
>>>
>>> On Tue, Jun 8, 2010 at 8:51 AM, Kasper Daniel Hansen
>>> <kasperdanielhansen at gmail.com> wrote:
>>>>
>>>> Assuming I have two IRanges, each with multiple ranges, like
>>>> ir1 = IRanges(start = 3:6, width = 2)
>>>> ir2 = IRanges(start = 10:17, width = 2)
>>>>
>>>> Is there a fast way to compute a pairwise distance matrix between the
>>>> two sets, by which I mean
>>>> ii = 1
>>>> jj = 2
>>>> width(gaps(c(ir1[ii], ir2[jj])))
>>>> where ii, jj would index into a result matrix. Essentially this would
>>>> be an expanded version of findOverlaps, since any two ranges with
>>>> distance = 0, have an overlap.
>>>>
>>>> Is such functionality available in IRanges, in an efficient
>>>> implementation (think of the case where the two IRanges have - say -
>>>> 10,000 ranges or more)?
>>>>
>>>> Kasper
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> --
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>
More information about the Bioconductor
mailing list