[BioC] A code kata: evening-out *Ranges ends.

Wed Aug 25 09:16:59 CEST 2010

  Steve,
Riffing off of Martin's post, I would use the match function to do the 
mapping:

fence.posts <- reduce(reads)
ranges(reads) <- ranges(fence.posts)[match(reads, fence.posts)]

It is not as fast as the Rle/cut based method by Martin, but it is more 
readable and, thus, easier to maintain.

Cheers,
Patrick

On 8/24/10 9:17 PM, mtmorgan at fhcrc.org wrote:
> Quoting Steve Lianoglou <mailinglist.honeypot at gmail.com>:
>
>> In case anyone else is trying to sharpen their IRanges-fu, here's a
>> possible solution to my own question -- works much faster.
>>
>> Instead of:
>>
>>>  reads <- GRanges(seqnames='chr1', strand='+',
>>>                ranges=IRanges(start=c(sample(18:22, 10, replace=TRUE),
>>>                                 sample(100:108, 10, replace=TRUE)),
>>>                  width=sample(18:20, 20, replace=TRUE)))
>>>
>>> You can see there are two "islands" of reads there, at around
>>> positions 20-40, and 100-120.
>>> What I want to do is to "fix" the reads in each island to have the
>>> same start/end position.
>>>
>>> Here's code I have to do that (forget any "strand" issues here --
>>> those are already taken care of):
>>>
>>>  fence.posts <- reduce(reads)
>>>  o <- findOverlaps(fence.posts, reads)
>
> Just
>
>   ranges(reads)[subjectHits(o)] <- ranges(fence.posts)[queryHits(o)]
>
> ? (I would have done findOverlaps(reads, fence.posts), but...)
>
>>>  subjects <- subjectHits(o)
>>>  query <- queryHits(o)
>>>
>>>  for (idx in unique(query)) {
>>>    adjust <- subjects[query == idx]
>>>    start(reads[adjust]) <- start(fence.posts[idx])
>>>    end(reads[adjust]) <- end(fence.posts[idx])
>>>  }
>>
>> I'm doing:
>>
>> covr <- coverage(reads)[[1]]
>> fences <- slice(covr, 1)
>
> or
>
>   s = covr > 1
>   breaks = cumsum(runLength(s))[!runValue(s)]
>   post = cut(start(reads), c(breaks, Inf), labels=FALSE)
>   ranges(reads) = ranges(fence.posts)[post]
>
> ? but the method above keeps track of sequences...
>
> Martin
>
>> f.repeat <- viewMaxs(fences)
>> repacked <- IRanges(start=rep(start(fences), f.repeat),
>>                     end=rep(end(fences), f.repeat))
>> if (length(repacked) != length(reads)) {
>>   stop("Length of repacked reads is not the same as the original")
>> reads <- GRanges(seqnames=seqnames(reads), ranges=repacked,
>>                  strand=strand(reads))
>>
>> I'm assuming that all the reads come from the same chromosome (so I
>> unlist the result from coverage), and I'm only sending in reads of the
>> same strand to this part of the function.
>>
>> Anyway, running this newer version on 1000 tags takes 0.181 seconds,
>> where the older version took 20 seconds.
>>
>> Thought some might find it interesting, otherwise sorry to spam.
>>
>> -steve
>>
>> -- 
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:  
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor