[BioC] loop over IRanges spaces

Patrick Aboyoun paboyoun at fhcrc.org
Mon Apr 26 21:08:43 CEST 2010


Yvan,
It appears to me that you are trying to perform two conflicting activities:

1) Calculate the running sum of a metric over an annotated sequence (as 
evidenced by your aggregate function call)
2) Find the sum for a metric across specified intervals on the annotated 
sequence (as evidenced by your desire to assign the aggregated sums into 
an existing RangedData object)

Taking a step back, I am guessing that you are trying to transform 
something akin to a UCSC bed file into something else that is UCSC bed 
file like. If you are using rtracklayer, this means your initial data 
are stored in a RangedData object. To create a RangedData object 
containing the running sum of a values column from an initial RangedData 
object, I recommend:

1) Creating an RleList object from the RangedData object using the 
coverage function. Make sure to specify the metric of interest in the 
weight argument to coverage.
2) Using the runsum function on the RleList object to calculate your 
running sums.
3) Creating a RangedData object from the RleList object in step 2 using 
as(<<obj>>, "RangedData")

Here is an example:

 > # Step 1: create an RleList representation of the metric
 > rd <- RangedData(IRanges(start = c(5, 10, 15, 2, 4, 8), end = c(7, 
14, 21, 3, 6, 9)),
                    score = 1:6, space = rep(c("A", "B"), each = 3))
 > scoreRleList <- coverage(rd, weight = "score", width = list(A = 30, B 
= 10))
 > scoreRleList
SimpleRleList of length 2
$A
'integer' Rle of length 30 with 6 runs
   Lengths: 4 3 2 5 7 9
   Values : 0 1 0 2 3 0

$B
'integer' Rle of length 10 with 6 runs
   Lengths: 1 2 3 1 2 1
   Values : 0 4 5 0 6 0

 > # Step 2: calculate the running sums
 > scoreRunsum <- runsum(scoreRleList, k = 3, endrule = "constant")
 > scoreRunsum
SimpleRleList of length 2
$A
'integer' Rle of length 30 with 15 runs
   Lengths: 3 1 1 1 1 1 1 1 3 1 1 5 1 1 8
   Values : 0 1 2 3 2 1 2 4 6 7 8 9 6 3 0

$B
'integer' Rle of length 10 with 7 runs
   Lengths:  2  1  1  1  1  1  3
   Values :  8 13 14 15 10 11 12


 > # Step 3: Create a RangedData representation of the running sums
 > rdRunsum <- as(scoreRunsum, "RangedData")
 > rdRunsum
RangedData with 22 rows and 1 value column across 2 spaces
           space    ranges   |     score
<character> <IRanges>   | <integer>
1             A  [ 1,  3]   |         0
2             A  [ 4,  4]   |         1
3             A  [ 5,  5]   |         2
4             A  [ 6,  6]   |         3
5             A  [ 7,  7]   |         2
6             A  [ 8,  8]   |         1
7             A  [ 9,  9]   |         2
8             A  [10, 10]   |         4
9             A  [11, 13]   |         6
...         ...       ... ...       ...
14            A  [22, 22]   |         3
15            A  [23, 30]   |         0
16            B  [ 1,  2]   |         8
17            B  [ 3,  3]   |        13
18            B  [ 4,  4]   |        14
19            B  [ 5,  5]   |        15
20            B  [ 6,  6]   |        10
21            B  [ 7,  7]   |        11
22            B  [ 8, 10]   |        12

 > sessionInfo()
R version 2.11.0 Patched (2010-04-24 r51820)
i386-apple-darwin9.8.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] IRanges_1.6.1

loaded via a namespace (and not attached):
[1] tools_2.11.0



On 4/23/10 6:26 AM, Michael Lawrence wrote:
> Also note that it's not really necessary to loop here, as is often the case
> with IRanges:
>
> rd$windo<- unlist(runmean(RleList(values(rd)[,"score"])))
>
> On Fri, Apr 23, 2010 at 6:13 AM, Michael Lawrence<michafla at gene.com>  wrote:
>
>    
>>
>> On Fri, Apr 23, 2010 at 1:49 AM, Yvan<yvan.strahm at uni.no>  wrote:
>>
>>      
>>>   Hello,
>>> Thank you both of you.
>>>
>>> I could could calculate the sliding window, but not as a Rle object, could
>>> not append values for the last w-1 position in the Rle object in order to
>>> take care of the size problem.
>>>
>>>
>>>        
>> Why not? Rle supports all the normal vector operations. And runsum or
>> runmean will output a vector of the same size as the input, using a choice
>> of two endrules. If you want 0's at the end, try something like:
>>
>> rle[(nrow(rd)-w+1):nrow(rd)]<- 0
>>
>>
>>
>>      
>>> So I did it like that:
>>>
>>> params<-RDApplyParams(rd,function(rd)
>>> append((diff(c(0,cumsum(rd$score)),lag=w)/w),rep(0,each=w-1),after=(length(rd$score)-w+1)))
>>>
>>> But when I try to add the new values to the rangedData object I got these
>>> error.
>>>
>>>        
>>>> values(rd)[,"windo"]<-rdapply(params)
>>>>          
>>> Error in `[<-`(`*tmp*`, , j, value =<S4 object of class "DataFrame">) :
>>>    ncol(x[j]) != ncol(value)
>>> In addition: Warning messages:
>>> 1: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>    longer argument not a multiple of length of shorter
>>> 2: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>    longer argument not a multiple of length of shorter
>>>
>>> But when I check the size, they are the same, here for one space
>>>
>>>        
>>>> x<-rdapply(params)
>>>> length(x$SCAFFOLD_100) == length(rd["SCAFFOLD_100"]$windo)
>>>>          
>>> [1] TRUE
>>>
>>>
>>>        
>> It may be that that type of insertion is unsupported. Why not just do
>> something like:
>>
>> rd$window<- unlist(x)
>>
>>
>>      
>>>   Maybe params miss a parameter or the way I try to update the rd object is
>>> wrong. Anyway form the rdapply output a vector could be created and so a new
>>> rd object with the new value column.
>>>
>>> yvan
>>>
>>>
>>> On 22/04/10 15:51, Michael Lawrence wrote:
>>>
>>>
>>>
>>> On Thu, Apr 22, 2010 at 5:49 AM, Michael Dondrup<Michael.Dondrup at uni.no>wrote:
>>>
>>>        
>>>> Hi,
>>>> how about funtion rdapply (not lapply) which is for that?
>>>>
>>>>
>>>>          
>>> lapply() should apply per-space as well, basically providing a short-cut
>>> for the more complicated rdapply().
>>>
>>>        
>>>> lapply(rd, function(x) sum(x$score))
>>>>          
>>> $chr1
>>> [1] 3
>>>
>>> $chr2
>>> [1] 0
>>>
>>> sapply() also works:
>>>        
>>>> sapply(rd, function(x) sum(x$score))
>>>>          
>>> chr1 chr2
>>>     3    0
>>>
>>> Another choice is tapply:
>>>        
>>>> tapply(rd$score, space(rd), sum)
>>>>          
>>> chr1 chr2
>>>     3    0
>>>
>>> Michael
>>>
>>>   The code below computes the sum score for each space in the RangedData:
>>>        
>>>> # taken from the examples mostly:
>>>>          
>>>>> ranges<- IRanges(c(1,2,3),c(4,5,6))
>>>>>    score<- c(2L, 0L, 1L)
>>>>>    rd<- RangedData(ranges, score, space = c("chr1","chr2","chr1"))
>>>>> rd
>>>>>            
>>>> RangedData with 3 rows and 1 value column across 2 spaces
>>>>         space    ranges |     score
>>>>   <character>  <IRanges>  |<integer>
>>>> 1        chr1    [1, 4] |         2
>>>> 2        chr1    [3, 6] |         1
>>>> 3        chr2    [2, 5] |         0
>>>>          
>>>>> params<- RDApplyParams(rd, function(rd) sum(score(rd)))
>>>>> rdapply(params)
>>>>>            
>>>> $chr1
>>>> [1] 3
>>>>
>>>> $chr2
>>>> [1] 0
>>>>
>>>>
>>>> Cheers
>>>> Michael
>>>>
>>>> Am Apr 22, 2010 um 1:57 PM schrieb Yvan:
>>>>
>>>>          
>>>>> On 21/04/10 18:43, Michael Lawrence wrote:
>>>>>            
>>>>>>
>>>>>> On Wed, Apr 21, 2010 at 6:07 AM, Yvan<yvan.strahm at uni.no
>>>>>> <mailto:yvan.strahm at uni.no>>  wrote:
>>>>>>
>>>>>>     Hello List,
>>>>>>
>>>>>>     I am confused about  how to loop over a rangedData object.
>>>>>>     I have this rangedData
>>>>>>
>>>>>>     RangedData with 61 rows and 1 value column across 3 spaces
>>>>>>             space     ranges |     score
>>>>>>     <character>  <IRanges>  |<numeric>
>>>>>>     1   SCAFFOLD_1 [  8,   8] |  -0.09405
>>>>>>
>>>>>>     and the spaces are
>>>>>>
>>>>>>     "SCAFFOLD_1"   "SCAFFOLD_10"  "SCAFFOLD_100"
>>>>>>
>>>>>>     using aggregate it is possible to apply a function to one of the
>>>>>>              
>>>> space
>>>>          
>>>>>>     aggregate(rd["SCAFFOLD_1"]$score, start =
>>>>>>     1:(length(rd["SCAFFOLD_1"]$score)-w+1), width = w, FUN = sum)
>>>>>>
>>>>>>     but how can I apply the aggregate to all space without a for loop ?
>>>>>>
>>>>>>
>>>>>> It looks like you're attempting a running window sum of the score
>>>>>> vector. There are more efficient ways of doing this besides
>>>>>> aggregate(). If you convert the score into an Rle, you can use
>>>>>>              
>>>> runsum().
>>>>          
>>>>>> Anyway, to do this over each space individually, use lapply().
>>>>>>
>>>>>> This would come out to something like:
>>>>>>
>>>>>> values(rd)[,"smoothScore"]<- lapply(rd, function(x)
>>>>>> runsum(Rle(x$score), w))
>>>>>>
>>>>>> Probably not exactly right, but it gets you in the right direction...
>>>>>>
>>>>>> Michael
>>>>>>
>>>>>>
>>>>>>              
>>>>> Hello Michael,
>>>>>
>>>>> Thanks for the answer and the tip about runsum!
>>>>> I try with lapply but could not get it working right, the main problem
>>>>> is that the runsum is calculated on all values and not for a each
>>>>> specific spaces.
>>>>> Sorry, I should have been more precise in the problem description.
>>>>> The runsum should be calculated in a space specific manner, let say w=2
>>>>>
>>>>> space        score    cumsum
>>>>> 1 space1    1            3
>>>>> 2 space1    2            4
>>>>> 3 space1    2            NA
>>>>> 4 space2    10          21
>>>>> 5 space2    11          22
>>>>> 6 space2    11          NA
>>>>>
>>>>> Is it possible to do it with lapply?
>>>>> Thanks again for your help
>>>>> cheers,
>>>>> yvan
>>>>>
>>>>>        [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>>            
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>>          
>>>
>>>
>>>        
>>      
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list