[BioC] reducing RangedData

Patrick Aboyoun paboyoun at fhcrc.org
Fri Oct 23 00:14:49 CEST 2009


Robert,
To address your case of data extraction from the original RangedData 
object, I would recommend using "[" with i = <<RangesList>>, e.g. 
origRD[ranges(reducedRd),].

The situation becomes more involved when, for example, someone wants to 
sum up the "score" column values for all the ranges that overlapped. 
These data summary operations are open ended and don't lend themselves 
to simple interfaces, although we can provide targeted functions, or add 
arguments to the reduce function, that support specific data summary 
operations on ancillary columns.


Patrick


Robert Castelo wrote:
> hi Patrick,
>
> thanks a lot, the concise code snippet below is what i was looking for,
> i'd had written at least twice as much to do the same job :)
>
> regarding whether it would be nice to have a reduction operation with a
> signature that includes other stuff i think that for the particular case
> of the strand it is definitely useful to project (reduce in RangedData
> terminology) genomic coordinates on genomic space taking it into account
> because many of the functional elements in the genome are stranded and
> projecting doesn't imply always that one is not interested in the strand
> of the projected intervals.
>
> with respect to whether other information contained in additional
> columns should somehow be "reduced" i think it would suffice to enable
> the reduction of the "names" of a RangedData (taking, for instance,
> arbitrarily one name between two overlapping ranges) because those names
> would allow to go back to the original (un-reduced) data and pull
> whatever we're interested in (actually, including the strand).
>
> my two-cents about this, and thanks again for your quick help.
>
> robert.
>
> On Thu, 2009-10-22 at 00:58 -0700, Patrick Aboyoun wrote:
>   
>> Robert,
>> I wrote you some code that you can use below. I've seen other reduction 
>> operations where the user wanted to reduce a RangedData type object by 
>> more than one column (e.g. strand and score), so perhaps an appropriate 
>> signature for a reduce method for RangedData would be
>>
>> reduce(x, by, ...)
>>
>> If this method where written, the remaining question would be should it 
>> support reducing transformations of the remaining values columns or 
>> should non "by" columns be dropped from the output. I'm inclined towards 
>> the latter, since a reduce(x, by, valuesFUN) method wouldn't be much 
>> simpler than the rdapply method given below. Would a straight-forward 
>> reduce(x = rd, by = "strand") approach for RangedData meet your needs?
>>
>>
>>  > suppressMessages(library(IRanges))
>>  > sta <- c(1, 2, 5, 6, 2, 3, 4, 5)
>>  > end <- c(3, 4, 5, 7, 2, 4, 6, 7)
>>  > str <- rep(c("+", "+", "-", "-"), 2)
>>  > chr <- rep(c("chr1", "chr2"), each = 4)
>>  > rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>  > rd
>> RangedData: 8 ranges by 1 columns
>> columns(1): strand
>> sequences(2): chr1 chr2
>>  >
>>  > # Interesting bit starts here
>>  > reduceStranded <- function(x) {
>> +     strand <- x$strand
>> +     rngs <- ranges(x)[[1]]
>> +     ans <- IRangesList("+" = reduce(rngs[strand == "+"]),
>> +                        "-" = reduce(rngs[strand == "-"]))
>> +     RangedData(unlist(ans, use.names = FALSE),
>> +                strand = rep(c("+", "-"), elementLengths(ans)),
>>                  space = names(x))
>> + } 
>>  > params <- RDApplyParams(rd, reduceStranded, reducerFun = function(x) 
>> do.call(c, unname(x)))
>>  > rdapply(params)
>> RangedData: 4 ranges by 1 columns
>> columns(1): strand
>> sequences(2): chr1 chr2
>>  > as.data.frame(rdapply(params))
>>   space start end width strand
>> 1  chr1     1   4     4      +
>> 2  chr1     5   7     3      -
>> 3  chr2     2   4     3      +
>> 4  chr2     4   7     4      -
>>  > sessionInfo()
>> R version 2.9.2 Patched (2009-08-24 r49420)
>> i386-apple-darwin9.8.0
>>
>> locale:
>> C/en_US.UTF-8/C/C/C/C
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base    
>>
>> other attached packages:
>> [1] IRanges_1.2.3
>>
>>
>>
>>
>> Robert Castelo wrote:
>>     
>>> dear list,
>>>
>>> i'd like to project a set of exons on genomic space but keeping
>>> the strand information. let's assume for simplicity that no exons
>>> overlap in opposite strands so no conflicts need to be resolved
>>> regarding that case. in principle this is easy without keeping the
>>> strand using a RangeList and the reduce method. however i've been
>>> struggling for a while without success to figure out how can i
>>> "reduce" my coordinates without loosing the strand information.
>>>
>>> my guess is that to carry out the strand information i need to
>>> use a RangedData object:
>>>
>>> sta <- c(1, 1, 1)
>>> end <- c(3, 4, 5)
>>> str <- c("+", "+", "-")
>>> rd <- RangedData(IRanges(start=sta, end=end), strand=str, space=chr)
>>> rd
>>> RangedData: 3 ranges by 1 columns
>>> columns(1): strand
>>> sequences(2): chr1 chr2
>>>
>>> and then use rdapply to reduce on each of the chromosomes but i
>>> either don't know how to directly apply reduce:
>>>
>>> params <- RDApplyParams(rd, reduce)
>>> rdapply(params)
>>> Error in function (classes, fdef, mtable)  :
>>>   unable to find an inherited method for function "reduce", for 
>>> signature "RangedData"
>>>
>>>
>>> or i loose the strand information:
>>>
>>> params <- RDApplyParams(rd, function(rd) reduce(ranges(rd)))
>>> rdapply(params)
>>> $chr1
>>> RangesList: 1 range
>>> 1: 1:4
>>>
>>> $chr2
>>> RangesList: 1 range
>>> 1: 1:5
>>>
>>> thanks for your help!
>>> robert.
>>>
>>>
>>>
>>> sessionInfo()
>>> R version 2.9.1 (2009-06-26)
>>> i386-apple-darwin8.11.1
>>>
>>> locale:
>>> en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8
>>>
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>>
>>> other attached packages:
>>>  [1] rtracklayer_1.4.1                     RCurl_0.98-1
>>>  [3] bitops_1.0-4.1                        GOstats_2.10.0
>>>  [5] graph_1.22.2                          Category_2.10.1
>>>  [7] geneplotter_1.22.0                    lattice_0.17-25
>>>  [9] limma_2.18.2                          bgSpJncArrayDm1.db_1.0.1
>>> [11] org.Dm.eg.db_2.2.11                   RSQLite_0.7-1
>>> [13] DBI_0.2-4                             
>>> BSgenome.Dmelanogaster.UCSC.dm1_1.0.0
>>> [15] BSgenome_1.12.3                       RColorBrewer_1.0-2
>>> [17] Biostrings_2.12.8                     IRanges_1.2.3
>>> [19] annotate_1.22.0                       AnnotationDbi_1.6.1
>>> [21] Biobase_2.4.1                         xtable_1.5-5
>>>
>>> loaded via a namespace (and not attached):
>>> [1] genefilter_1.24.2 GO.db_2.2.11      grid_2.9.1        GSEABase_1.6.1
>>> [5] RBGL_1.20.0       splines_2.9.1     survival_2.35-4   tools_2.9.1
>>> [9] XML_2.5-3
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: 
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>       
>>     
>
>



More information about the Bioconductor mailing list