[BioC] fastest way to keep score when reduce Granges data

Hervé Pagès hpages at fhcrc.org
Wed Feb 26 20:58:38 CET 2014


Hi Jianhong,

Thanks for the new code. Still didn't work out of the box for me but I
managed to run it after some minor edit. Please understand that before
we can show you the "high efficient code", we need to understand exactly
what you're trying to do. And since you didn't give much details, the
"slow code" is the only thing we have.

So what you're trying to do looks like a common use case to me: you
have a numeric variable defined a along the genome (the score in your
case), and you want to summarize it on tiling windows (often fixed-size
windows but they don't need to be). It's a classic problem in digital
signal processing, called resampling. Note that you cannot just "keep"
the score, you need to summarize in a way or another. In your case,
you choose to sum the scores associated with the ranges that overlap
with a given tiling window.

This topic is covered in the examples section of the tileGenome()
function in the GenomicRanges package. Note that the approach described
there uses the "weighted coverage" of the variable to solve the problem,
no findOverlaps() or reduce() is needed:

   score <- coverage(.dat, weight="score")

This puts the score in an RleList object so now there is a score
associated to each genomic position. Note that when the input
object has overlapping ranges (which is the case for your input
object '.dat'), the score associated to a given genomic position is
the sum of the scores associated to the original ranges that cover
that position.

Then to summarize 'score' on your fixed-size tiling windows, you need
a summarizing function like the binnedAverage() function shown in
?tileGenome. binnedAverage() computes the average on each window but
it's easy to write a summarizing function that computes the sum:

      binnedSum <- function(bins, numvar, mcolname)
      {
          stopifnot(is(bins, "GRanges"))
          stopifnot(is(numvar, "RleList"))
          stopifnot(identical(seqlevels(bins), names(numvar)))
          bins_per_chrom <- split(ranges(bins), seqnames(bins))
          sums_list <- lapply(names(numvar),
              function(seqname) {
                  views <- Views(numvar[[seqname]],
                                 bins_per_chrom[[seqname]])
                  viewSums(views)
              })
          new_mcol <- unsplit(sums_list, as.factor(seqnames(bins)))
          mcols(bins)[[mcolname]] <- new_mcol
          bins
      }

Then:

      GRwin2 <- binnedSum(GRwin, score, "binned_score")

This will not give you the same result as what you get with your
"slow code" below because of how you deal with original ranges
spanning more than one tiling window (you arbitrary assign the range
to one tiling window), and also because the contribution of the
original score values to our final "binned score" here is weighted
by the width of the original range (but since all the ranges in your
'.dat' object have width=2, you can easily adjust this).

Hope this helps,
H.


On 02/26/2014 08:40 AM, Ou, Jianhong wrote:
> Hi Herve,
>
> I am not mean I have the code. As I asked, I want the high efficient one.
> What I am doing now is that something like this (much faster than last
> codes)
>
> size <- 50000
> FUN <- sum
>
> .dat <- GRanges("chr1", IRanges(start=1:size, width=2), strand="+",
> score=sample(1:size, size))
> windowSize <- 10
> GRwin <- GRanges("chr1",
> IRanges(start=(0:(size/windowSize))*windowSize+scale[1]-1,
>                                              width=windowSize), strand="+")
> ##new codes, rename seqnames and reduce
> system.time({
> ##split by strand NO NEED
> ol <- as.data.frame(findOverlaps(.dat, GRwin))
> ol <- ol[!duplicated(ol[,1]), 2]
> ##change the seqnames by windows.
> new.seqname <- paste(seqnames(.dat), "__gp", ol, sep="")
> .newData <- GRanges(new.seqname, IRanges(start(.dat), end(.dat),
> names=names(.dat)), strand=strand(.dat))
> mcols(.newData) <- mcols(.dat)
> .datR <- reduce(.newData, with.mapping=TRUE)
> .datR$score <- sapply(.datR$mapping, function(.idx) FUN(.dat$score[.idx]))
> .datReduceWithScore <- GRanges(gsub("__gp\\d+$", "", seqnames(.datR)),
> IRanges(start(.datR), end(.datR)), strand=strand(.datR),
> score=score(.datR))
> })
>
> ##user  system elapsed
> ##1.469   0.022   1.492
> ##old codes, split and reduce
> system.time({
> ol <- findOverlaps(.dat, GRwin)
> ol <- as.data.frame(ol)
> ol <- ol[!duplicated(ol[,1]),]
> .datS <- split(.dat, ol[,2])
> reduceValue <- function(.datReduce){
>                    .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
>                    wid <- width(.datReduce)
>                    .datReduceScore <- .datReduce$score
>                    .datReduceM$score <- sapply(.datReduceM$mapping,
> function(.idx){
> FUN(.datReduceScore[.idx])
>                    })
>                    .datReduceM$mapping <- NULL
>                    .datReduceM
>                }
> .datReduceWithScore2 <- lapply(.datS, reduceValue)
> .datReduceWithScore2 <- unlist(GRangesList(.datReduceWithScore2))
> })
>
> ##   user  system elapsed
> ##300.591   3.833 310.751
> .datReduceWithScore <-
> .datReduceWithScore[order(as.character(seqnames(.datReduceWithScore)),
> start(.datReduceWithScore))]
> names(.datReduceWithScore2) <- NULL
> identical(.datReduceWithScore, .datReduceWithScore2)
> ##TRUE
>
> Yours Sincerely,
>
> Jianhong Ou
>
> LRB 670A
> Program in Gene Function and Expression
> 364 Plantation Street Worcester,
> MA 01605
>
>
>
>
> On 2/25/14 8:21 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:
>
>> Hi Jianhong,
>>
>> It would help enormously if you could send code that we can actually
>> run. Thanks!
>>
>> H.
>>
>>
>> On 02/24/2014 07:53 AM, Ou, Jianhong wrote:
>>> Hi ALL,
>>>
>>> I want to reduce a GRanges data by fixed window size and keep scores
>>> after reduce. My code is
>>>
>>> .dat <- GRanges("chr1", Iranges(start=1:50, width=2), strand="+",
>>> score=Sample(1:50, 50))
>>> windowSize <- 10
>>> Grwin <- GRanges("chr1", IRanges(start=(0:5)*windowSize+scale[1]-1,
>>>                                             width=windowSize),
>>> strand="+")
>>> ol <- findOverlaps(.dat, GRwin)
>>> ol <- as.data.frame(ol)
>>> ol <- ol[!duplicated(ol[,1]),]
>>> .dat <- split(.dat, ol[,2])
>>> reduceValue <- function(.datReduce){
>>>                   .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
>>>                   wid <- width(.datReduce)
>>>                   .datReduceScore <- .datReduce$value
>>>                   .datReduceM$score <- sapply(.datReduceM$mapping,
>>> function(.idx){
>>>
>>> round(sum(.datReduceScore[.idx]*wid[.idx])/sum(wid[.idx]))
>>>                   })
>>>                   .datReduceM$mapping <- NULL
>>>                   .datReduceM
>>>               }
>>> .dat <- lapply(.dat, reduceValue)
>>> .dat <- unlist(GRangesList(.dat))
>>>
>>> But the efficiency is very low. What is the best way to keep scores
>>> when reduce GRanges data by fixed window size? Thanks for your help.
>>>
>>> Yours sincerely,
>>>
>>> Jianhong Ou
>>>
>>> LRB 670A
>>> Program in Gene Function and Expression
>>> 364 Plantation Street Worcester,
>>> MA 01605
>>>
>>> 	[[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list