[BioC] fastest way to keep score when reduce Granges data

Ou, Jianhong Jianhong.Ou at umassmed.edu
Wed Feb 26 17:40:02 CET 2014


Hi Herve,

I am not mean I have the code. As I asked, I want the high efficient one.
What I am doing now is that something like this (much faster than last
codes)

size <- 50000
FUN <- sum

.dat <- GRanges("chr1", IRanges(start=1:size, width=2), strand="+",
score=sample(1:size, size))
windowSize <- 10
GRwin <- GRanges("chr1",
IRanges(start=(0:(size/windowSize))*windowSize+scale[1]-1,
                                            width=windowSize), strand="+")
##new codes, rename seqnames and reduce
system.time({
##split by strand NO NEED
ol <- as.data.frame(findOverlaps(.dat, GRwin))
ol <- ol[!duplicated(ol[,1]), 2]
##change the seqnames by windows.
new.seqname <- paste(seqnames(.dat), "__gp", ol, sep="")
.newData <- GRanges(new.seqname, IRanges(start(.dat), end(.dat),
names=names(.dat)), strand=strand(.dat))
mcols(.newData) <- mcols(.dat)
.datR <- reduce(.newData, with.mapping=TRUE)
.datR$score <- sapply(.datR$mapping, function(.idx) FUN(.dat$score[.idx]))
.datReduceWithScore <- GRanges(gsub("__gp\\d+$", "", seqnames(.datR)),
IRanges(start(.datR), end(.datR)), strand=strand(.datR),
score=score(.datR))
})

##user  system elapsed
##1.469   0.022   1.492
##old codes, split and reduce
system.time({
ol <- findOverlaps(.dat, GRwin)
ol <- as.data.frame(ol)
ol <- ol[!duplicated(ol[,1]),]
.datS <- split(.dat, ol[,2])
reduceValue <- function(.datReduce){
                  .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
                  wid <- width(.datReduce)
                  .datReduceScore <- .datReduce$score
                  .datReduceM$score <- sapply(.datReduceM$mapping,
function(.idx){
FUN(.datReduceScore[.idx])
                  })
                  .datReduceM$mapping <- NULL
                  .datReduceM
              }
.datReduceWithScore2 <- lapply(.datS, reduceValue)
.datReduceWithScore2 <- unlist(GRangesList(.datReduceWithScore2))
})

##   user  system elapsed
##300.591   3.833 310.751
.datReduceWithScore <-
.datReduceWithScore[order(as.character(seqnames(.datReduceWithScore)),
start(.datReduceWithScore))]
names(.datReduceWithScore2) <- NULL
identical(.datReduceWithScore, .datReduceWithScore2)
##TRUE

Yours Sincerely,

Jianhong Ou

LRB 670A
Program in Gene Function and Expression
364 Plantation Street Worcester,
MA 01605




On 2/25/14 8:21 PM, "Hervé Pagès" <hpages at fhcrc.org> wrote:

>Hi Jianhong,
>
>It would help enormously if you could send code that we can actually
>run. Thanks!
>
>H.
>
>
>On 02/24/2014 07:53 AM, Ou, Jianhong wrote:
>> Hi ALL,
>>
>> I want to reduce a GRanges data by fixed window size and keep scores
>>after reduce. My code is
>>
>> .dat <- GRanges("chr1", Iranges(start=1:50, width=2), strand="+",
>>score=Sample(1:50, 50))
>> windowSize <- 10
>> Grwin <- GRanges("chr1", IRanges(start=(0:5)*windowSize+scale[1]-1,
>>                                            width=windowSize),
>>strand="+")
>> ol <- findOverlaps(.dat, GRwin)
>> ol <- as.data.frame(ol)
>> ol <- ol[!duplicated(ol[,1]),]
>> .dat <- split(.dat, ol[,2])
>> reduceValue <- function(.datReduce){
>>                  .datReduceM <- reduce(.datReduce, with.mapping=TRUE)
>>                  wid <- width(.datReduce)
>>                  .datReduceScore <- .datReduce$value
>>                  .datReduceM$score <- sapply(.datReduceM$mapping,
>>function(.idx){
>>                 
>>round(sum(.datReduceScore[.idx]*wid[.idx])/sum(wid[.idx]))
>>                  })
>>                  .datReduceM$mapping <- NULL
>>                  .datReduceM
>>              }
>> .dat <- lapply(.dat, reduceValue)
>> .dat <- unlist(GRangesList(.dat))
>>
>> But the efficiency is very low. What is the best way to keep scores
>>when reduce GRanges data by fixed window size? Thanks for your help.
>>
>> Yours sincerely,
>>
>> Jianhong Ou
>>
>> LRB 670A
>> Program in Gene Function and Expression
>> 364 Plantation Street Worcester,
>> MA 01605
>>
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>>http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
>-- 
>Hervé Pagès
>
>Program in Computational Biology
>Division of Public Health Sciences
>Fred Hutchinson Cancer Research Center
>1100 Fairview Ave. N, M1-B514
>P.O. Box 19024
>Seattle, WA 98109-1024
>
>E-mail: hpages at fhcrc.org
>Phone:  (206) 667-5791
>Fax:    (206) 667-1319



More information about the Bioconductor mailing list