[BioC] IRanges:::coverage() speedup/enchancement

Tue Dec 1 21:31:38 CET 2009

Chuck,
Thanks for the URL and paper reference. I'll take a look at them.

Also thanks for the code snippets. I always find code to be useful user 
feedback. One thing I wanted to point out is that I have tried to 
provide methods for the most commonly used operations in R's base and 
stats packages for Rle objects so users can avoid coercion to dense 
vector representations of their data. In particular, there are split() 
and mean() methods for Rle objects so to get the means for the insertion 
and match types you could use the code

lapply(split(as.vector(jurkat.wider.cover),
             jurkat[jurkat.RL$orig.rows,'type']),
       mean)

The split operation will create an RleList and the lapply(..., mean) 
operation will produce the summary results you had before. There is a 
table method for Rle objects as well, but it currently only support 
univariate summaries:

table(as.vector(jurkat.wider.cover))

I'll add support for bivariate table summaries to my TODO list as well.

Cheers,
Patrick

Charles C. Berry wrote:
> On Tue, 1 Dec 2009, Patrick Aboyoun wrote:
>
>> Chuck,
>> Thanks for the speedup to coverage for large width IRanges objects. I 
>> checked in changes to IRanges 1.5.13 in BioC 2.6 (for use with 
>> R-devel) based on the code you submitted. I'll back port this code to 
>> BioC 2.5 (R 2.10) in the next few days as well.
>
> Thank you so much!
>
>> Just out of curiosity, what is the source of these long width 
>> intervals, where do the weights for these intervals come from, and 
>> what operations do you perform on the resulting coverage vectors?
>
> Re the source of the intervals:
>
> I collaborate with Rick Bushman's group at Upenn 
> (http://microb230.med.upenn.edu/), which studies 'the transfer of 
> genetic information between cells and organisms'.
>
> It is of particular interest to know where a mobile DNA element is 
> preferentially sited in a genome. (And not just for scientific 
> reasons, gene therapy constructs can do serious damage if they land in 
> the wrong place, so designing good gene therapy vectors is aided by 
> being able to predict where a vector tends to land.)
>
> In our papers (for example Berry et al. Selection of Target Sites for 
> Mobile DNA Integration in the Human Genome. PLoS Comput Biol 2(11): 
> e157. doi:10.1371/journal.pcbi.0020157), it emerged that features like 
> 'gene density' (number of genes per base in a window of a specified 
> width), CpG island site density, and DNASE I site density have 
> substantial effects on integration preference. The length scale for 
> 'width' affects the conclusions, and widths greater than a megabase 
> and perhaps to tens of megabases seem to matter.
>
> ---------------
>
> Re the weights:
>
> In performing the calculations of 'expression density' (counts per 
> base of genes expressed above some threshhold value) using Affy 
> arrays, I had to deal with varying numbers of probesets for different 
> genes. It seemed to me that counting all probesets with equal weight 
> would introduce unwanted variation, so I used an inverse weighting 
> scheme so that the sum of weights for each gene would equal one.
>
> Michael's workaround seems reasonable.
>
> ----------------
>
> Re the operations performed on coverage:
>
> You can browse the paper mentioned above to get the idea, but the 
> operations are just a lookup of the coverage values (with some 
> adjustments made for gaps and chromosome ends) followed by data 
> analysis using those values.
>
> In the idiom of the IRanges package (and simplified a bit for easier 
> reading), something like this:
>
> ## jurkat is a data.frame describing HIV integration site locations ## 
> (aka 'insertions') and those of in silico controls ('match')
>
>> xtabs( ~type, jurkat )
> type
> insertion     match
>     39237    116161
>
>
> all.chr <- levels(jurkat$Chr)
>
> ### locations as a RangedData instance:
>
> jurkat.RL <- RangedData(IRanges(start=jurkat$Position,width=1),
>     space=jurkat$Chr, strand=jurkat$Ort,orig.rows=rownames(jurkat))
>
> ### get a 1MBase gene density mapping for geneHuman():
>
> library(GenomicFeatures.Hsapiens.UCSC.hg18)
> knownGenes <- geneHuman()
> known.RL <- with( subset(knownGenes, chrom%in%all.chr ),
>           RangedData( IRanges(start=txStart,en=txEnd),
>                               space=factor(chrom, all.chr ),
>                               strand=factor(strand,levels(jurkat$Ort))))
> wider.RL <- known.RL
> start( wider.RL ) <- start( wider.RL ) - 5e5
> end( wider.RL ) <-  end( wider.RL ) + 5e5
> wider.cover <- coverage( wider.RL, width = seqlengths(Hsapiens)[ 
> names( wider.RL ) ] )
>
> ### now lookup the coverage (aka density) values for the insertions 
> and ### matches:
>
> jurkat.wider.cover <- wider.cover[ ranges(jurkat.RL) ]
>
> ## now match the coverage values back to the parent data.frame and do 
> some ## statistical analysis:
>
> ## simple examples:
>
>> table( as.vector( as.vector( jurkat.wider.cover ) ),
> +    jurkat[ jurkat.RL$orig.rows, 'type' ] )
>
>
>       insertion match
>   0         993 14881
>   1         568  7490
>   2         843  6453
>   3        1039  7292
>   4        1044  6104
>   5        1044  5590
>   6        1045  5086
> [snip]
>
>>  lapply(split(as.vector(as.vector(jurkat.wider.cover)),
> +      jurkat[jurkat.RL$orig.rows,'type']),mean)
>
> $insertion
> [1] 52.6881
>
> $match
> [1] 22.71713
>
>>
>
> Chuck
>
>> Echoing Michael's comments, we haven't supported double precision 
>> weights in coverage calculations in the past because we hadn't 
>> encountered any common use cases for them and there is the workaround 
>> Michael mentioned if the need arose. Providing some context for the 
>> enhancement request would help motivate us to make a change. :)
>>
>>
>> Cheers,
>> Patrick
>>
>>
>>
>> Michael Lawrence wrote:
>>>  On Mon, Nov 30, 2009 at 11:10 AM, Charles C. Berry
>>>  <cberry at tajo.ucsd.edu>wrote:
>>>
>>>
>>> >  The semantics of the IRanges package and especially the 
>>> RangedData class
>>> >  are very apprpriate for some of the applications I deal with.
>>> > >  Unfortunately, coverage() is too slow to be useful to me.
>>> > >  I wonder if the Biocore Team would consider retooling it to 
>>> make it
>>> >  faster? Below I provide a link to a revised coverage.c that might 
>>> >  suffice.
>>> > >  The kind of case I need to handle has width values in 10kbase 
>>> to 10Mbase
>>> >  range. As a toy example, being able to run stuff like
>>> > >       tmp <- coverage( IRanges( start=seq(1,by=1000,length=10000),
>>> >                         width=1e7 ) )
>>> > >  quickly is needed.
>>> > >  A revised version of coverage.c is available at
>>> > >  
>>> http://cabig2.ucsd.edu:8080/Plone/Members/ccberry/software/coverage.c/view 
>>>
>>> > >  It will handle the case above almost instantly (while the 
>>> existing >  version
>>> >  needs about 8 minutes on my machine) and seems about equal to the
>>> >  existing version for cases with width=30.  In the cases I've 
>>> looked at
>>> >  gc() reports the same memory usage.
>>> > >  ---
>>> > >  Also, I wonder if the Biocore Team would entertain allowing the 
>>> 'weight'
>>> >  argument of coverage to be of type double? This would help in 
>>> cases in
>>> >  which downweighting of counts of some genomic features is desired.
>>> > > >
>>>  In many use cases, it's probably sufficient to simply round 
>>> floating point
>>>  numbers to integers after multiplying by a power of 10. That only 
>>> goes so
>>>  far though, so supporting double-precision seems reasonable. The 
>>> type of
>>>  the
>>>  output will simply depend on the type of the weights.
>>>
>>>
>>>
>>>
>>> >  Thanks,
>>> > >  Chuck
>>> > >  --
>>> >  Charles C. Berry                            (858) 534-2098
>>> >                                             Dept of Family/Preventive
>>> >  Medicine
>>> >  E mailto:cberry at tajo.ucsd.edu               UC San Diego
>>> >  http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 
>>> >  92093-0901
>>> > >  _______________________________________________
>>> >  Bioconductor mailing list
>>> >  Bioconductor at stat.math.ethz.ch
>>> >  https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> >  Search the archives:
>>> >  http://news.gmane.org/gmane.science.biology.informatics.conductor
>>> > >
>>>   [[alternative HTML version deleted]]
>>>
>>>  _______________________________________________
>>>  Bioconductor mailing list
>>>  Bioconductor at stat.math.ethz.ch
>>>  https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>  Search the archives:
>>>  http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>>
>
> Charles C. Berry                            (858) 534-2098
>                                             Dept of Family/Preventive 
> Medicine
> E mailto:cberry at tajo.ucsd.edu                UC San Diego
> http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 
> 92093-0901
>
>