[BioC] subdivideGRanges() to a given number of ranges

Martin Morgan mtmorgan at fhcrc.org
Thu Sep 6 22:33:40 CEST 2012


On 09/06/2012 11:33 AM, d r wrote:
> Thank you Mike and Valarie,
>
> Just to be sure I understood you clearly: I understand there is no
> possibilty to use  subdivideGRanges() to split a GRanges to a given number
> of elements.
> If I understood you correctly, it is also not possible to read a vector
> into the subsize argument. is taht true or is ther some possibilty to do
> that?


Not sure if this is the best way, but for a GRanges instance 'gr' 
dividing it in to 'n' bins, you might start by making sure there won't 
be any zero-width or necessarily overlapping bins


     if (any(width(gr) < n))
         stop("all 'width(gr)' must be >= 'n'")

then figure out the width, in numeric values, of each bin

     d <- width(gr) / n

To make a vector of offsets from the start of each original bin, you 
want to do this in a 'vectorized' fashion, rather than a loop, so I 
replicated each element of d n times, calculated the cumulative sum, and 
then subtracted for each of the d x n elements the value of the 
cumulative sum at the start of each replicate, a little obscure I guess...

     dd <- cumsum(rep(d, each=n))
     mask <- logical(n); mask[1] <- TRUE  # every n'th is base for range
     dd <- dd - rep(dd[mask], each=n)

To get the new starts, replicate the original starts and add the 
offsets, using round() to approximate to the nearest integer value

     starts <- round(rep(start(gr), each=n) + dd)

the ends will be the starts of the next range, minus 1 (the 'L' says to 
represent 0 and 1 as integer values, rather than numeric, and avoids 
unnecessary coercion from integer to numeric and back)

     ends <- c(starts[-1] - 1L, 0L)

except for each of the n'th (including last) elements, which will be the 
end of the original range

     ends[rev(mask)] <- end(gr)

we can then replicate our original data and update the ranges

     gr <- gr[rep(seq_along(gr), each=n)]
     ranges(gr) <- IRanges(starts, ends)

getting the result

     gr

Here's the above as a function

intoNbins <-
     function(gr, n = 10)
{
     if (any(width(gr) < n))
         stop("all 'width(gr)' must be >= 'n'")
     d <- width(gr) / n
     dd <- cumsum(rep(d, each=n))
     mask <- logical(n); mask[1] <- TRUE
     dd <- dd - rep(dd[mask], each=n)

     starts <- round(rep(start(gr), each=n) + dd)
     ends <- c(starts[-1], 0) - 1L
     ends[rev(mask)] <- end(gr)

     gr <- gr[rep(seq_along(gr), each=n)]
     ranges(gr) <- IRanges(starts, ends)
     gr
}

and in action

 > gr <- GRanges("A", IRanges(c(1, 21, 21), width=c(10, 20, 30)))
 > table(width(intoNbins(gr, 10)))

  1  2  3
10 10 10

It might make sense to split the GRanges into a GRangesList, one element 
for each CpG island, replacing the last line of intoNbins with

     split(gr, rep(seq_along(d), each=n))

Martin

>
> thanks again
> Dolev
>
> On Thu, Sep 6, 2012 at 5:17 PM, Michael Love <love at molgen.mpg.de> wrote:
>
>> Hi Dolev,
>>
>> The subdivideGRanges() function cannot accomplish this task.
>>
>> from the manual:
>>
>> "Takes an input GRanges object and, splits each range into multiple ranges
>> of nearly equal width. For an input range of width w and subdividing size
>> s, it will subdivide the range into max(1,floor(w/s)) nearly equal width
>> ranges."
>>
>> So the function tries to keep the widths of the subdivided ranges around s
>> basepairs, rather than keeping the number of subdivided ranges constant.
>>
>> I will include a more helpful error message if a vector is provided for
>> the subsize argument.
>>
>> cheers,
>>
>> Mike
>>
>>
>>
>> On 09/06/12 16:02, Valerie Obenchain wrote:
>>
>>> Hi Dolev,
>>>
>>> I am copying the maintainer of exomeCopy which is the package
>>> subdivideGRanges() comes from. Please provide a small reproducable example
>>> of the problem you are having (i.e., a few lines of code that someone else
>>> can run to get the same error). This makes it easier for others to help.
>>>
>>> Valerie
>>>
>>> On 09/06/12 04:39, d r wrote:
>>>
>>>> Hello
>>>>
>>>> I have a GRanges object representing CpG islands, that I want to
>>>> subdivide
>>>> so that each island will be represented by 10 ranges of equal width.
>>>>
>>>> To do this, I created a vector containing the widths of the islands
>>>> divided
>>>> by ten:
>>>>
>>>> swidth<-as.integer(width(**islandshg19)/10)
>>>>
>>>>
>>>>
>>>> and then called subdivideGRanges():
>>>>
>>>>
>>>> islandshg19<-subdivideGRanges(**islandshg19,
>>>> subsize=as.integer(width/10))
>>>>
>>>>
>>>> and got a multitude of warnings.
>>>>
>>>>
>>>> This approach did work when I tried it on a GRanges object containing a
>>>> single range, and a width vector of length 1. When I tried to extend the
>>>> sample to include two objetcs, I got these warning message:
>>>>
>>>> 1: In if (widths[i]<  2 * subsize) { ... :
>>>>
>>>>     the condition has length>  1 and only the first element will be used
>>>>
>>>> 2: In if (width<  2 * subsize) { ... :
>>>>
>>>>     the condition has length>  1 and only the first element will be used
>>>>
>>>> 3: In 0:(nchunks - 1) : numerical expression has 2 elements: only the
>>>> first used
>>>>
>>>>
>>>>
>>>> Which makes me think that the problem may be related to assigning the
>>>> right
>>>> subszie to the the rigth range.
>>>>
>>>> Does anyone have an idea how to do that?
>>>>
>>>> Thanks in advance
>>>>
>>>> Dolev Rahat
>>>>
>>>>
>>>>
>>>> sessionInfo(if required):
>>>>
>>>> R version 2.15.1 (2012-06-22)
>>>>
>>>> Platform: x86_64-pc-mingw32/x64 (64-bit)
>>>>
>>>>
>>>>
>>>> locale:
>>>>
>>>> [1] LC_COLLATE=English_United States.1252 LC_CTYPE=English_United
>>>> States.1252    LC_MONETARY=English_United States.1252
>>>>
>>>> [4] LC_NUMERIC=C LC_TIME=English_United
>>>> States.1252
>>>>
>>>>
>>>>
>>>> attached base packages:
>>>>
>>>> [1] stats     graphics  grDevices utils     datasets  methods base
>>>>
>>>>
>>>>
>>>> other attached packages:
>>>>
>>>>    [1] exomeCopy_1.2.0      Rsamtools_1.8.6 Biostrings_2.24.1
>>>> rtracklayer_1.16.3   taRifx_1.0.4         reshape2_1.2.1
>>>> plyr_1.7.1
>>>>
>>>>    [8] XML_3.9-4.1          GenomicRanges_1.9.54 IRanges_1.15.40
>>>> BiocGenerics_0.3.1   BiocInstaller_1.4.7
>>>>
>>>>
>>>>
>>>> loaded via a namespace (and not attached):
>>>>
>>>> [1] bitops_1.0-4.1  BSgenome_1.25.7 RCurl_1.91-1.1 stats4_2.15.1
>>>> stringr_0.6.1   tools_2.15.1    zlibbioc_1.2.0
>>>>
>>>>      [[alternative HTML version deleted]]
>>>>
>>>> ______________________________**_________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/**listinfo/bioconductor<https://stat.ethz.ch/mailman/listinfo/bioconductor>
>>>> Search the archives: http://news.gmane.org/gmane.**
>>>> science.biology.informatics.**conductor<http://news.gmane.org/gmane.science.biology.informatics.conductor>
>>>>
>>>
>>>
>>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list