[BioC] subsetting SummarizedExperiments - a proposal (or a hack?)

Thu Feb 6 06:24:27 CET 2014

Martin,

Excellent!   I have been warned.  Dragons lurking.

Suppose I replied...

... wait for it....

"Enter defmacro - stage right"

                                                 (greek chorus chanting "oh, no, harken the days of yore, LISP resounds")

Erhm... it _is_ a little late.

But, seriously.  Thanks for the subset clues on SO!  

And, seriously, I _may_ try a defmacro rewrite.   Unless...?

Steve, I'm just getting going with SEs and expect them to be on my road map for a while.

https://gist.github.com/malcook/8836871 holds a better formatted version of my proposed subset.SummarizedExperiments - sorry for the earlier copy/paste madness.  Martin's subset admonitions notwithstanding, I _will_ include a copy of my cast(Wide|Long) functions for you enjoyment.  Or my ridicule.  Time will tell.  Always learning....

Martin, I am coming to agree with your remark "Ranges required? Not really, but a bit of a hack" appearing in
http://www.bioconductor.org/help/course-materials/2012/BiocEurope2012/SummarizedExperiment.pdf
Indeed, lets imagine that rowData were _not_ a GRanges... rather, if the dimensional meta-data were of uniform mode (I'm thinking data.table), then symmetry would easily allow for recursion on the number of dimensions.  Instead of `colData(se)` and `rowData(se)` we would have `dimData(se)[[i]]` with dimnames<-function(x) {lapply(dimData(x),names)}.

Hmmm.   I'm liking this.  Are you?

What would lack is a means to add some sort of useful genomic index to a data.table.  Something like BLASTgres'  BioIndexing on genomic `location` (a custom data type) defined on any set of columns from a table - http://www.cs.ucla.edu/~stott/bioinf/BLASTgres.pdf  - In postgres the implementation depends upon user defined database types and indexes, and indexing schemes.   Do you see profit here?  Or is this the stuff of dreams?

G'night,

~ malcolm_cook at stowers.org

________________________________________
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Wednesday, February 05, 2014 7:54 PM
To: Cook, Malcolm
Cc: bioconductor at r-project.org
Subject: Re: subsetting SummarizedExperiments - a proposal (or a hack?)

Hi Malcom --

On 02/05/2014 02:11 PM, Cook, Malcolm wrote:
> Martin,
>
> Google says: 'Your search - "subset.SummarizedExperiment" - did not match any documents.'

If it had existed, it would have been an S4 method revealed by

  library(GenomicAlignments)   # library(GenomicRanges) in release
  showMethods("subset")
  method?"subset,SummarizedExperiment"
  ?"subset,SummarizedExperiment-method"

with tab completion available on the last two after getting through "subset". Google would have found "subset,SummarizedExperiment-method" (if it existed) but Google and tab completion with ? would have failed (though the latter in perhaps a suggestive way by showing alternate tab completions) when a relevant method is defined on a contained class. But yes, a method does not exist.

>
> So I offer the below for your consideration, along with a few examples and a quick benchmark comparison with using the more explicit `[` extraction operator.
>
>
>
> subset.SummarizedExperiment<-function(x
>                                        ,rowSubset=TRUE
>                                        ,colSubset=TRUE
>                                        ,assaySubset=TRUE
>                                        ,drop=FALSE
>                                        ,provenanceTrack=FALSE
>                                        ,...) {
>      ## PURPOSE: implement subsetting of SummarizedExperiments,
>      ## allowing expressions in terms of:
>      ##
>      ##  + the GRanges (and its mcols meta-data) held in rowData
>      ##  + the experimental meta-data held in the colData DataFrame
>      ##  + the names of the assays
>      x<-x[
>           ##eval(as.expression(substitute(row)),mcols(rowData(x)),.GlobalEnv)
>           eval(as.expression(substitute(rowSubset)),as.data.frame(rowData(x)),.GlobalEnv)
>           ,eval(as.expression(substitute(colSubset)),colData(x),.GlobalEnv)
>           ,drop=drop,...]
>      if (! identical(TRUE,assaySubset)) assays(x)<-assays(x)[assaySubset]
>      if(provenanceTrack) {
>          exptData(x)$rowSubset<-c(exptData(x)$rowSubset,as.character(substitute(rowSubset)))
>          exptData(x)$colSubset<-c(exptData(x)$colSubset,as.character(substitute(colSubset)))
>      }
>      x
> }
> attr(subset,'ex')<-function() {
>      example(SummarizedExperiment)
>      assays(se1)$a2<-assays(se1)$counts*2
>      assays(se1)$a3<-assays(se1)$counts*3
>      benchmark(replications=100
>                ,se1.ss1<-se1[start(rowData(se1))==344757,se1$Treatment=='ChIP']
>                ,se1.ss2<-subset(se1,start==344757,Treatment=='ChIP')
>                )
>      stopifnot(identical(assays(se1.ss1),assays(se1.ss2)))
>      se1.ss3<-subset(se1,strand=='+',Treatment=='ChIP',c('a2','a3'))
>      stopifnot(identical(c('a2','a3'),names(assays(se1.ss3))))
> }
>
>
> The rbenchmarks shows the cost of overhead in this contrived minimal example.  YMMV:
>
>                                                                      test replications elapsed relative user.self sys.self user.child sys.child
>   3                                                          c("a2", "a3")          100   0.001        1     0.001    0.000          0         0
>   1 se1.ss1 <- se1[start(rowData(se1)) == 344757, se1$Treatment == "ChIP"]          100   3.113     3113     3.111    0.001          0         0
>   2           se1.ss2 <- subset(se1, start == 344757, Treatment == "ChIP")          100   3.486     3486     3.486    0.000          0         0
>
> Is all this in the spirit of things as you see it?   It sure makes my use of SE "scan" (erhm, like poetry;)

My own problem with subset() is that the context of the call is hard to get correct. So your subset with this

f <- function(x) {
    id <- 697568
    subset(x, start==id)
}

fails

> f(se1)
Error in x[eval(as.expression(substitute(rowSubset)), as.data.frame(rowData(x)), :
  error in evaluating the argument 'i' in selecting a method for function '[': Error in eval(expr, envir, enclos) : object 'id' not found

and this fails (does not select the value of id set in f()) silently

> id <- 387456
> start(f(se1))
[1] 387456

Other R idioms are similarly dangerous even for data.frame using base::subset.data.frame as illustrated here

  http://stackoverflow.com/questions/9860090/in-r-why-is-better-than-subset

in the high-voted answer.

But maybe I shouldn't stand between users, guns, and feet?

A couple of small SummarizedExperiment things in your code ...

   start(se1) == start(rowData(se1))

and likewise for other functions in the GRanges 'API'. Also it turns out that it is expensive in the (current) implementation of SummarizedExperiment to associated dimnames with returned assay() or assays(), so it is much faster to

    assays(x) <- assays(x, withDimnames=FALSE)[assaySubset]

(dimnames get stored on the rowData and colData rather than redundantly on (each of the) assays, although this is probably a false (space) economy). The combination of GRanges API and expensive assay() / assays() can help save one from unintended inefficiencies, e.g., dimnames(se1) rather than dimnames(assay(se1)).

Martin

>
> if you like, I also have castLong and castWide functions  to reshape a (subsetted) SE as a data.table for use in ggplot and friends.
>
> You like?
>
> Cheers,
>
> ~Malcolm
>

--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793