[BioC] Re(surrecting): [Bioc-sig-seq] Rsamtools countBam labeling

Fri Apr 1 05:27:21 CEST 2011

On 03/31/2011 12:49 PM, Cook, Malcolm wrote:
> Martin,
>
> Hmmm, Thanks .... I think I'm getting it....
>
> Following your lead, I can directly re-order cnt in the OriginalOrder as:
>
>> cnt[sort(unlist(split(values(which2), seqnames(which2)))$OriginalOrder,index.return=TRUE)$ix,]
>    space start  end width    file records nucleotides
> 2  seq2   100 1000   901 ex1.bam    1169       41235
> 1  seq1  1000 2000  1001 ex1.bam     612       21549
> 3  seq2  1000 2000  1001 ex1.bam     642       22640
>
>
> ... which can usefully(?) be abstracted as
>
> countBamWhich<- function (file,which,index=file,...) {
> ### wrapper for countBam that reorders its results to aggree with
> ### 'which', a required ScanBamParam.  ... params are additional
> ### ScanBamParam options.
> ###
> ### ASSUMES: internal implementation detail of ScanBamParam. c.f.
> ### http://permalink.gmane.org/gmane.science.biology.informatics.conductor/34208)
>    param<- ScanBamParam(which=which,...)
>    values(which)[['OriginalOrder']]<- 1:length(which)
>    CBW = countBam(file,index,param=ScanBamParam(which=which))
>    CBW[sort(unlist(split(values(which),
>                          seqnames(which)))$OriginalOrder,index.return=TRUE)$ix,]
> }

Hi Malcolm -- maybe the return value is a GRanges with counts? It's 
probably better to split a simple vector and use that to impose order on 
the result, rather than splitting values(). The return value could then 
be cbind'ed as needed...

countBam_GRanges <-
     function(file, index=file, ..., param=GRanges())
{
     if (0L == length(param))
         stop("'param' must have length() != 0")
     ## FIXME: not sure about 'drop' argument to split?
     splt <- split(seq_len(length(param)), as.factor(seqnames(param)))
     o <- order(unlist(splt, use.names=FALSE))
     cnt <- countBam(file, index, ..., param=ScanBamParam(which=param))
     df <- as(cnt[o, c("records", "nucleotides")], "DataFrame")
     values(param) <- df
     param
}

counts <- countBam_GRanges(fl, param=which1)
values(which1) <- cbind(values(which1), values(counts))

Martin

>
> allowing me to write
>
>> countBamWhich(fl, which2)
>    space start  end width    file records nucleotides
> 2  seq2   100 1000   901 ex1.bam    1169       41235
> 1  seq1  1000 2000  1001 ex1.bam     612       21549
> 3  seq2  1000 2000  1001 ex1.bam     642       22640
>
> All in favor?
>
> ~ Malcolm
>
>
>
>
>> -----Original Message-----
>> From: Martin Morgan [mailto:mtmorgan at fhcrc.org]
>> Sent: Thursday, March 31, 2011 11:41 AM
>> To: Cook, Malcolm
>> Cc: 'myoung at wehi.EDU.AU'; 'bioconductor at r-project.org';
>> 'Bioc-sig-sequencing at r-project.org'
>> Subject: Re: [BioC] Re(surrecting): [Bioc-sig-seq] Rsamtools
>> countBam labeling
>>
>> On 03/31/2011 08:32 AM, Cook, Malcolm wrote:
>>> Matt et. al.,
>>>
>>> I wonder if a satisfactory resolution to the issue of "the order
>>> changes between the GRanges object and the countBam data.frame
>>>
>>>
>> http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg01144
>>> .html
>>>
>>>   I am presented with the same issue and poised to tackle it
>> but wondr
>>> if a generic solution emerged from you inquiries&   efforts.
>>
>> Hi Malcolm --
>>
>> For a reproducible example,
>>
>>     library(Rsamtools)
>>     example(countBam)
>>     which1<- as(which, "GRanges")
>>     ## which2 might be where your data actually starts
>>     which2<- which1[c(2,1,3)]
>>     values(which2)[["OriginalOrder"]]<- 1:3
>>     param<- ScanBamParam(which=which2)
>>     cnt<- countBam(fl, param=param)
>>
>> What happens is that ScanBamParam converts its argument to an
>> IRangesList, using split(ranges(which2), seqnames(which2)).
>> So do the same for the values and unlist
>>
>>     cntVals<- unlist(split(values(which2), seqnames(which2)))
>>
>> then cbind coerced values
>>
>>     cbind(cnt, as.data.frame(cntVals))
>>
>> with
>>
>>   >  which2
>> GRanges with 3 ranges and 1 elementMetadata value
>>       seqnames       ranges strand | OriginalOrder
>>          <Rle>     <IRanges>   <Rle>  |<integer>
>> [1]     seq2 [ 100, 1000]      * |             1
>> [2]     seq1 [1000, 2000]      * |             2
>> [3]     seq2 [1000, 2000]      * |             3
>>
>> seqlengths
>>    seq1 seq2
>>      NA   NA
>>   >  cbind(cnt, as.data.frame(cntVals))
>>     space start  end width    file records nucleotides OriginalOrder
>> 1  seq1  1000 2000  1001 ex1.bam     612       21549             2
>> 2  seq2   100 1000   901 ex1.bam    1169       41235             1
>> 3  seq2  1000 2000  1001 ex1.bam     642       22640             3
>>
>> Martin
>>
>>>
>>> Thanks,
>>>
>>> Malcolm Cook Stowers Institute for Medical Research -
>> Bioinformatics
>>> Kansas City, Missouri  USA
>>>
>>>
>>> _______________________________________________
>> Bioconductor mailing
>>> list Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor Search the
>>> archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Computational Biology
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>>
>> Location: M1-B861
>> Telephone: 206 667-2793

-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793