[BioC] stranded findOverlaps

Mon Jan 25 20:02:56 CET 2010

On Jan 25, 2010, at 12:56 PM, Michael Lawrence wrote:

> On Fri, Jan 22, 2010 at 11:41 AM, Robert Castelo <robert.castelo at upf.edu>wrote:
> 
>> dear list, and particularly, the IRanges developers,
>> 
>> i'm using the function findOverlaps from the IRanges package because i
>> need to find what stranded genomic intervals from one set (as a
>> RangedData object) overlap with what stranded genomic intervals from
>> another set (as another RangedData object). the problem is that i don't
>> what to consider overlaps between genomic intervals from different
>> strands.
>> 
>> i've been looking to the help page of findOverlaps (devel version, see
>> my sessionInfo() below) and searched through the BioC mailinglist and my
>> preliminary conclusion is that such an operation is not yet supported.
>> 
>> i've been thinking of using rdapply to break down the RangedData objects
>> into spaces and then again by the two strands but the problem is that
>> the query and subject indexes resulting of findOverlaps will not match
>> the dimension of the original RangedData objects.
>> 
>> so, i'd like to suggest that some option is added to this useful
>> function to restrict the overlapping search by strand. of course, if
>> this is somehow already implemented and i just missed it, then i'll be
>> very grateful if you let me know what function/parameter i should be
>> using.
>> 
>> 
> Well, IRanges knows nothing about Biology, so a 'strand' option would be out
> of place, in my opinion. That said, I can think of at least two approaches.
> 
> 1) Simply filter the results for matches that are the the same strand. This
> is something as simple as:
> result <- findOverlaps(a, b)
> mat <- as.matrix(result)
> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],]
> 
> 2) Out of recognition that we are really treating the two strands as
> separate spaces, break down the RangedData into chrom*strand spaces, as in:
> rd <- RangedData(...)
> rd <- do.call(c, split(rd, rd$strand))
> result <- findOverlaps(rd, ...)
> ## then maybe eventually go back chromosome spaces
> rds <- split(rd, rd$strand)
> names(rds[[1]]) <- chromNames
> names(rds[[2]]) <- chromNames
> rd <- do.call(rbind, rds)
> 
> The second approach would be very convenient if you always want to treat the
> strands separately. The separation could be specified at construction time,
> e.g.:
> RangedData(ranges, strand, space = interaction(chrom, strand))
> 
> But in general neither of these are awfully convenient, and I've always had
> the suspicion that we'd eventually need multiple space variables. Yes, we
> could add some argument to the findOverlaps method for RangedData that takes
> a vector of variable names for splitting into subspaces, but I think we
> would want a more general solution, where the RangedData itself has the
> notion of subspaces. This would be a non-trivial change. Would it behave
> like a nested list in some ways?
> 
> Hopefully others have better ideas...

We will need good support for stranded genomic intervals.  This is a very important case to handle, and will be even more important in the future where a number of assays will be stranded.  We need support for doing operations on such objects, both ignoring strand and not ignoring strand.

An example could be that we take (stranded) genome annotation and what to perform a per-chromosome reduce().  Users might want to do a reduce respecting strand information where we would get one IRanges per chromosome * strand or we might want to do a reduce(anno , ignoreStrand = TRUE) which yields one IRanges per chromosome.

I agree that the general design might be to allow for any number of nested subspaces, but we do have a very important special case where we know that the second level of nestedness only have two components.  I believe a lot of value would be gained from being able to operate easily on such objects.

Kasper