[BioC] loop over IRanges spaces

Yvan yvan.strahm at uni.no
Thu Apr 29 13:51:27 CEST 2010


Hello Patrick,

Sorry for the late answer and thank you very much for these precisions.

Maybe I should try to explain what I am trying to do. I am starting with 
a gff file produced by Nimblegen's NimbleScan software for a CGH high 
density array. Out of the nine columns, I am interested mainly in the 
first (seqname) and the sixth (score) even if all columns are loaded 
into a gff object/matrix(correct term?).  The main purpose of my project 
is to filtrate out some region of each spaces based on sliding window 
over the score values. The same methodology used to filtrate out the 
regions will be used in a second pass to detect the region of interest. 
I was hoping using IRanges to perform the first sliding window on the 
different spaces, and then create some views for each spaces based on 
the sliding window results and finally recalculate the sliding window on 
all the views. After these two passes, a track for the ucsc genome 
browser should be created and also a gff file in order to view the 
results in the SignalMap software from Nimblegen.

Hope I was clear enough and do you think that IRanges is the correct tool?
Thank you again for your time and help.
Cheers,
yvan


On 26/04/10 21:08, Patrick Aboyoun wrote:
> Yvan,
> It appears to me that you are trying to perform two conflicting 
> activities:
>
> 1) Calculate the running sum of a metric over an annotated sequence 
> (as evidenced by your aggregate function call)
> 2) Find the sum for a metric across specified intervals on the 
> annotated sequence (as evidenced by your desire to assign the 
> aggregated sums into an existing RangedData object)
>
> Taking a step back, I am guessing that you are trying to transform 
> something akin to a UCSC bed file into something else that is UCSC bed 
> file like. If you are using rtracklayer, this means your initial data 
> are stored in a RangedData object. To create a RangedData object 
> containing the running sum of a values column from an initial 
> RangedData object, I recommend:
>
> 1) Creating an RleList object from the RangedData object using the 
> coverage function. Make sure to specify the metric of interest in the 
> weight argument to coverage.
> 2) Using the runsum function on the RleList object to calculate your 
> running sums.
> 3) Creating a RangedData object from the RleList object in step 2 
> using as(<<obj>>, "RangedData")
>
> Here is an example:
>
> > # Step 1: create an RleList representation of the metric
> > rd <- RangedData(IRanges(start = c(5, 10, 15, 2, 4, 8), end = c(7, 
> 14, 21, 3, 6, 9)),
>                    score = 1:6, space = rep(c("A", "B"), each = 3))
> > scoreRleList <- coverage(rd, weight = "score", width = list(A = 30, 
> B = 10))
> > scoreRleList
> SimpleRleList of length 2
> $A
> 'integer' Rle of length 30 with 6 runs
>   Lengths: 4 3 2 5 7 9
>   Values : 0 1 0 2 3 0
>
> $B
> 'integer' Rle of length 10 with 6 runs
>   Lengths: 1 2 3 1 2 1
>   Values : 0 4 5 0 6 0
>
> > # Step 2: calculate the running sums
> > scoreRunsum <- runsum(scoreRleList, k = 3, endrule = "constant")
> > scoreRunsum
> SimpleRleList of length 2
> $A
> 'integer' Rle of length 30 with 15 runs
>   Lengths: 3 1 1 1 1 1 1 1 3 1 1 5 1 1 8
>   Values : 0 1 2 3 2 1 2 4 6 7 8 9 6 3 0
>
> $B
> 'integer' Rle of length 10 with 7 runs
>   Lengths:  2  1  1  1  1  1  3
>   Values :  8 13 14 15 10 11 12
>
>
> > # Step 3: Create a RangedData representation of the running sums
> > rdRunsum <- as(scoreRunsum, "RangedData")
> > rdRunsum
> RangedData with 22 rows and 1 value column across 2 spaces
>           space    ranges   |     score
> <character> <IRanges>   | <integer>
> 1             A  [ 1,  3]   |         0
> 2             A  [ 4,  4]   |         1
> 3             A  [ 5,  5]   |         2
> 4             A  [ 6,  6]   |         3
> 5             A  [ 7,  7]   |         2
> 6             A  [ 8,  8]   |         1
> 7             A  [ 9,  9]   |         2
> 8             A  [10, 10]   |         4
> 9             A  [11, 13]   |         6
> ...         ...       ... ...       ...
> 14            A  [22, 22]   |         3
> 15            A  [23, 30]   |         0
> 16            B  [ 1,  2]   |         8
> 17            B  [ 3,  3]   |        13
> 18            B  [ 4,  4]   |        14
> 19            B  [ 5,  5]   |        15
> 20            B  [ 6,  6]   |        10
> 21            B  [ 7,  7]   |        11
> 22            B  [ 8, 10]   |        12
>
> > sessionInfo()
> R version 2.11.0 Patched (2010-04-24 r51820)
> i386-apple-darwin9.8.0
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] IRanges_1.6.1
>
> loaded via a namespace (and not attached):
> [1] tools_2.11.0
>
>
>
> On 4/23/10 6:26 AM, Michael Lawrence wrote:
>> Also note that it's not really necessary to loop here, as is often 
>> the case
>> with IRanges:
>>
>> rd$windo<- unlist(runmean(RleList(values(rd)[,"score"])))
>>
>> On Fri, Apr 23, 2010 at 6:13 AM, Michael Lawrence<michafla at gene.com>  
>> wrote:
>>
>>>
>>> On Fri, Apr 23, 2010 at 1:49 AM, Yvan<yvan.strahm at uni.no>  wrote:
>>>
>>>>   Hello,
>>>> Thank you both of you.
>>>>
>>>> I could could calculate the sliding window, but not as a Rle 
>>>> object, could
>>>> not append values for the last w-1 position in the Rle object in 
>>>> order to
>>>> take care of the size problem.
>>>>
>>>>
>>> Why not? Rle supports all the normal vector operations. And runsum or
>>> runmean will output a vector of the same size as the input, using a 
>>> choice
>>> of two endrules. If you want 0's at the end, try something like:
>>>
>>> rle[(nrow(rd)-w+1):nrow(rd)]<- 0
>>>
>>>
>>>
>>>> So I did it like that:
>>>>
>>>> params<-RDApplyParams(rd,function(rd)
>>>> append((diff(c(0,cumsum(rd$score)),lag=w)/w),rep(0,each=w-1),after=(length(rd$score)-w+1))) 
>>>>
>>>>
>>>> But when I try to add the new values to the rangedData object I got 
>>>> these
>>>> error.
>>>>
>>>>> values(rd)[,"windo"]<-rdapply(params)
>>>> Error in `[<-`(`*tmp*`, , j, value =<S4 object of class 
>>>> "DataFrame">) :
>>>>    ncol(x[j]) != ncol(value)
>>>> In addition: Warning messages:
>>>> 1: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>    longer argument not a multiple of length of shorter
>>>> 2: In mapply(f, ..., SIMPLIFY = FALSE) :
>>>>    longer argument not a multiple of length of shorter
>>>>
>>>> But when I check the size, they are the same, here for one space
>>>>
>>>>> x<-rdapply(params)
>>>>> length(x$SCAFFOLD_100) == length(rd["SCAFFOLD_100"]$windo)
>>>> [1] TRUE
>>>>
>>>>
>>> It may be that that type of insertion is unsupported. Why not just do
>>> something like:
>>>
>>> rd$window<- unlist(x)
>>>
>>>
>>>>   Maybe params miss a parameter or the way I try to update the rd 
>>>> object is
>>>> wrong. Anyway form the rdapply output a vector could be created and 
>>>> so a new
>>>> rd object with the new value column.
>>>>
>>>> yvan
>>>>
>>>>
>>>> On 22/04/10 15:51, Michael Lawrence wrote:
>>>>
>>>>
>>>>
>>>> On Thu, Apr 22, 2010 at 5:49 AM, Michael 
>>>> Dondrup<Michael.Dondrup at uni.no>wrote:
>>>>
>>>>> Hi,
>>>>> how about funtion rdapply (not lapply) which is for that?
>>>>>
>>>>>
>>>> lapply() should apply per-space as well, basically providing a 
>>>> short-cut
>>>> for the more complicated rdapply().
>>>>
>>>>> lapply(rd, function(x) sum(x$score))
>>>> $chr1
>>>> [1] 3
>>>>
>>>> $chr2
>>>> [1] 0
>>>>
>>>> sapply() also works:
>>>>> sapply(rd, function(x) sum(x$score))
>>>> chr1 chr2
>>>>     3    0
>>>>
>>>> Another choice is tapply:
>>>>> tapply(rd$score, space(rd), sum)
>>>> chr1 chr2
>>>>     3    0
>>>>
>>>> Michael
>>>>
>>>>   The code below computes the sum score for each space in the 
>>>> RangedData:
>>>>> # taken from the examples mostly:
>>>>>> ranges<- IRanges(c(1,2,3),c(4,5,6))
>>>>>>    score<- c(2L, 0L, 1L)
>>>>>>    rd<- RangedData(ranges, score, space = c("chr1","chr2","chr1"))
>>>>>> rd
>>>>> RangedData with 3 rows and 1 value column across 2 spaces
>>>>>         space    ranges |     score
>>>>> <character> <IRanges>  |<integer>
>>>>> 1        chr1    [1, 4] |         2
>>>>> 2        chr1    [3, 6] |         1
>>>>> 3        chr2    [2, 5] |         0
>>>>>> params<- RDApplyParams(rd, function(rd) sum(score(rd)))
>>>>>> rdapply(params)
>>>>> $chr1
>>>>> [1] 3
>>>>>
>>>>> $chr2
>>>>> [1] 0
>>>>>
>>>>>
>>>>> Cheers
>>>>> Michael
>>>>>
>>>>> Am Apr 22, 2010 um 1:57 PM schrieb Yvan:
>>>>>
>>>>>> On 21/04/10 18:43, Michael Lawrence wrote:
>>>>>>>
>>>>>>> On Wed, Apr 21, 2010 at 6:07 AM, Yvan<yvan.strahm at uni.no
>>>>>>> <mailto:yvan.strahm at uni.no>>  wrote:
>>>>>>>
>>>>>>>     Hello List,
>>>>>>>
>>>>>>>     I am confused about  how to loop over a rangedData object.
>>>>>>>     I have this rangedData
>>>>>>>
>>>>>>>     RangedData with 61 rows and 1 value column across 3 spaces
>>>>>>>             space     ranges |     score
>>>>>>> <character> <IRanges>  |<numeric>
>>>>>>>     1   SCAFFOLD_1 [  8,   8] |  -0.09405
>>>>>>>
>>>>>>>     and the spaces are
>>>>>>>
>>>>>>>     "SCAFFOLD_1"   "SCAFFOLD_10"  "SCAFFOLD_100"
>>>>>>>
>>>>>>>     using aggregate it is possible to apply a function to one of 
>>>>>>> the
>>>>> space
>>>>>>>     aggregate(rd["SCAFFOLD_1"]$score, start =
>>>>>>>     1:(length(rd["SCAFFOLD_1"]$score)-w+1), width = w, FUN = sum)
>>>>>>>
>>>>>>>     but how can I apply the aggregate to all space without a for 
>>>>>>> loop ?
>>>>>>>
>>>>>>>
>>>>>>> It looks like you're attempting a running window sum of the score
>>>>>>> vector. There are more efficient ways of doing this besides
>>>>>>> aggregate(). If you convert the score into an Rle, you can use
>>>>> runsum().
>>>>>>> Anyway, to do this over each space individually, use lapply().
>>>>>>>
>>>>>>> This would come out to something like:
>>>>>>>
>>>>>>> values(rd)[,"smoothScore"]<- lapply(rd, function(x)
>>>>>>> runsum(Rle(x$score), w))
>>>>>>>
>>>>>>> Probably not exactly right, but it gets you in the right 
>>>>>>> direction...
>>>>>>>
>>>>>>> Michael
>>>>>>>
>>>>>>>
>>>>>> Hello Michael,
>>>>>>
>>>>>> Thanks for the answer and the tip about runsum!
>>>>>> I try with lapply but could not get it working right, the main 
>>>>>> problem
>>>>>> is that the runsum is calculated on all values and not for a each
>>>>>> specific spaces.
>>>>>> Sorry, I should have been more precise in the problem description.
>>>>>> The runsum should be calculated in a space specific manner, let 
>>>>>> say w=2
>>>>>>
>>>>>> space        score    cumsum
>>>>>> 1 space1    1            3
>>>>>> 2 space1    2            4
>>>>>> 3 space1    2            NA
>>>>>> 4 space2    10          21
>>>>>> 5 space2    11          22
>>>>>> 6 space2    11          NA
>>>>>>
>>>>>> Is it possible to do it with lapply?
>>>>>> Thanks again for your help
>>>>>> cheers,
>>>>>> yvan
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>>
>>     [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>



More information about the Bioconductor mailing list