[BioC] stranded findOverlaps

Robert Castelo robert.castelo at upf.edu
Tue Feb 2 11:47:49 CET 2010


dear Michael D., Julien, Michael L., Steve, Kasper and Patrick,

thanks for all your input into this particular problem, i've
successfully used one of your solutions but from the discussion it seems
that this particular functionality may have a specific treatment in the
future, let me summarize your contributions for anybody else interested
into this and give my modest opinion as an end user of IRanges:

** Michael Dondrup:

On Fri, 2010-01-22 at 23:47 +0100, Michael.Dondrup at uni.no wrote:
> Following approach might be simple enough though (sorry no code
> example):
> - sort the data set of ranges (alignments, genes, sequencing reads)  
> into two groups by their strand (I assume you have this info
> somewhere)
> - construct two IRanges objects per set (aka query, reference), one  
> for plus one for minus
> - make one IRangesList per set, add corresponding IRanges objects,  
> name them "plus" and "minus" in the list
> - compute the overlap of the IRangesLists ( aka.: overlap(set1,
> set2) )
>   -> you'll get the overlaps in strand
> if you have chromosomes you construct two IRanges per chromosome in
> set

i also think this would work and it makes sense to me, i've however by
now chosen a quicker shortcut to the problem as you'll see below.

** Julien Gagneur:

On Sat, 2010-01-23 at 15:21 +0100, Julien Gagneur wrote: 
> for dealing with genomic intervals, you can also consider the
> genomeIntervals package. A class for stranded genomic intervals is
> available together with an interval overlap function that behaves
> in a strand-specific manner.

thanks for the information, although in my particular case since i
need also to use the rtracklayer package that works with RangedData
objects i prefer by now to stick to the IRanges functionalities as
long as i can get around the problems i find.

** Michael Lawrence:

On Mon, 2010-01-25 at 09:56 -0800, Michael Lawrence wrote:
> Well, IRanges knows nothing about Biology, so a 'strand' option would
> be out of place, in my opinion. That said, I can think of at least two
> approaches.
> 
> 1) Simply filter the results for matches that are the the same strand.
> This is something as simple as:
> result <- findOverlaps(a, b)
> mat <- as.matrix(result)
> mat <- mat[a$strand[mat[,1L]] == b$strand[mat[,2L]],]

yesss, i found this the quickest solution for me and it works fine,
thanks!

> 2) Out of recognition that we are really treating the two strands as
> separate spaces, break down the RangedData into chrom*strand spaces,
> as in:
> rd <- RangedData(...)
> rd <- do.call(c, split(rd, rd$strand))
> result <- findOverlaps(rd, ...)
> ## then maybe eventually go back chromosome spaces
> rds <- split(rd, rd$strand)
> names(rds[[1]]) <- chromNames
> names(rds[[2]]) <- chromNames
> rd <- do.call(rbind, rds)
> 
> The second approach would be very convenient if you always want to
> treat the strands separately. The separation could be specified at
> construction time, e.g.:
> RangedData(ranges, strand, space = interaction(chrom, strand))

i guess this is similar to what Michael Dondrup was suggesting.

> But in general neither of these are awfully convenient, and I've
> always had the suspicion that we'd eventually need multiple space
> variables. Yes, we could add some argument to the findOverlaps method
> for RangedData that takes a vector of variable names for splitting
> into subspaces, but I think we would want a more general solution,
> where the RangedData itself has the notion of subspaces. This would be
> a non-trivial change. Would it behave like a nested list in some
> ways? 

IMO this is probably the right and sound (yet sophisticated) general
perspective but which could be simplified if we would consider "strand"
as something more important than yet another space variable.

On Mon, 2010-01-25 at 13:47 -0800, Michael Lawrence wrote:
> It seems that RangedData (with a strand variable) has fallen into this
> role within the IRanges framework. At one point, there was a
> GenomicData subclass that did allow for special strand options.
> Unfortunately, the GenomicData() convenience constructor of RangedData
> is all that is left of that. So we could bring that back. Whatever
> happened to the proposed GenomeRanges package?

if, as you say above, IRanges should not know anything about biology,
some other package using it should know about it, i'd say.

** Steve Lianoglou:

On Mon, 2010-01-25 at 13:57 -0500, Steve Lianoglou wrote:
> How about defining findOverlaps on "AlignedRead" objects (from the
> ShortRead), and having "easy" ways to create an AlignedRead object out
> of IRanges/RangesList objects (with appropriate additional metadata)?
> 
> I reckon you'd need a juiced up findOverlaps function to add params
> specifying how to (or not) deal with the metadata in the AlignedRead
> objects, though, among other things.

in my case i'm just trying to associate transcript (thus, stranded)
annotations on the genome to stranded probes, so already without having
to go to high-throughput sequencing data, i'd say there is a need for a
stranded findOverlaps.

** Kasper Daniel Hansen:

On Mon, 2010-01-25 at 14:02 -0500, Kasper Daniel Hansen wrote:
> We will need good support for stranded genomic intervals.  This is a
> very important case to handle, and will be even more important in the
> future where a number of assays will be stranded.  We need support
> for doing operations on such objects, both ignoring strand and not
> ignoring strand.

i also support this view.

> An example could be that we take (stranded) genome annotation and what
> to perform a per-chromosome reduce().  Users might want to do a reduce
> respecting strand information where we would get one IRanges per
> chromosome * strand or we might want to do a reduce(anno , 
> ignoreStrand = TRUE) which yields one IRanges per chromosome.

Patrick addressed this some time ago in this post:

https://stat.ethz.ch/pipermail/bioconductor/2009-October/030196.html

> I agree that the general design might be to allow for any number of
> nested subspaces, but we do have a very important special case where
> we know that the second level of nestedness only have two components.
> I believe a lot of value would be gained from being able to operate
> easily on such objects.

fully agreed.

** Patrick Aboyoun:

On Tue, 2010-01-26 at 10:06 -0800, Patrick Aboyoun wrote:
> Michael,
> Before creating a new class to capture the strandedness of a RangedData 
> objects, it would be useful to have a list of methods in which 
> strandedness can be used:
> 
> Inter-interval ops: disjoin, gaps, reduce, range, coverage
> Between interval set ops: intersect, setdiff, union, findOverlaps, %in%, 
> match
> 
> For each of these operations, is strandedness a separate and unique 
> categorization of the data or are there other categories users would 
> like to group intervals during these operations? For example, I added a 
> "by" argument to the reduce method for RangedData because I was in 
> correspondence with someone who wanted to use both "strand" and "score" 
> columns to match during the reduction exercise. My instinct is that all 
> the inter-interval operations could use grouping capabilities and these 
> groupings are fluid where at one pass you might want to use a "score" 
> column and at another you would like to ignore that information. So in 
> short, I would argue to not add any more classes and instead add "by" 
> arguments to all the operations listed above that could support it and 
> for those between interval set operation that couldn't support it use 
> all the columns in the RangedData objects and not just the strand so if 
> you wanted to find the intersection of two RangedData objects, the 
> entire row of data would have to match and not just the intervals or the 
> stranded intervals.

popular online systems for the analysis of genomic data like, for instance,
the UCSC table browser, support stranded operations as long as you submit
stranded data. if you want to ignore strand then your don't include strand
in your submitted data set. i think the possibility of ignoring or not
the strand with the more common, let's say, genomic algebraic operations,
will be a very useful functionality.


thanks again everyone!!
robert.



More information about the Bioconductor mailing list