[BioC] some random comments on GRanges

Mon Sep 13 23:11:29 CEST 2010

Hi Kasper -- Mostly in agreement with what you say, with some caveats
below...

On 09/13/2010 11:57 AM, Michael Lawrence wrote:
> On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
> 
>> What follows is some random comments on the use of GRanges for storing
>> data in the elementMetadata slot.  My use case is a lot of very small
>> ranges (>10M ranges of width 1), which may be a special border case
>> (but very important to me :).  I am particular interested in fast
>> subsetting like subsetByOverlaps.
>>
>> Comments:
>>
>> * I find column subsetting on the elementMetadata to be really slow,
>> slower than it need to be I would say (I suspect some validity
>> checking, see comment below on row subsetting).
>> * There is no easy shortcut to get a specific metadata column (gr[,
>> "score"] returns a GRange).  I would find it natural that gr$score
>> would do that.  This is important in order to get at the data stores
>> in the GRange

To be sure we're on the same page, values(gr)$score get's what you want,
right?

In an earlier iteration on this theme Michael mentioned eSet. eSet
didn't strongly shape the discussion of GRanges, but since both pheno-
and feature-Data can have annotations I would rather that it had become
phenoData(eset)$x, featureData(eset)$y for symmetry and consistent
interface. Not sure that this is really relevant to GRanges.

>
>> * while there is a as.data.frame, as(from, "data.frame") does not work
>> and I would also appreciate an as(from, "GRanges") method with from
>> being a data.frame.  I have a quick function at the bottom of my
>> email, for doing this.

Not so sure about this though, for reasons Michael mentions.

>> * Since start, end, width are reserved column names for GRanges, it
>> would be nice if the GRanges constructor was extended to allow
>>  GRanges(seqnames = "chr1", start = 1, end = 10)
>> instead of now where we need
>>  GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10))
>> (ok, this is pretty minor, but it is convenient)

Not too keen on this, either, as it (or supporting both, anyway) makes
for a complicated interface; glad it's a pretty minor issue ;)

Martin

>> * There is no sort method implemented for GRanges; it returns an error
>>
>> * is.unsorted(xx) seems to always return FALSE with a warning when xx
>> is an IRange or a GRange
>> *It seems to me that row subsetting, specifically subsetByOverlaps is
>> a bit too slow.  The find overlaps part is blazingly fast (when ranges
>> is a single IRange); it seems like the initialization and subsetting
>> part takes "more time than necessary".  This is based on comparing a
>> (long) GRanges with around 8 elementMetadata columns to another
>> approach where instead of using elementMetadata I use two elements of
>> a list: a GRanges without elementMetadata and where the
>> elementMetadata is stored in a separate data.frame - I then use
>> findOverlaps to get me the indices (when I compare using a GRanges to
>> using a GRanges and separate matrix, I use the "position" GRanges to
>> get my indexes, which I use to subset the matrix with).
>>
>> Specifically here gr is a GRanges with length 25M (no elementMetadata)
>> and mat is a 25M x 8 matrix
>>
>> grIn <- gr
>> elementMetadata(grIn) <- mat
>>
>> ## The "classic" approach:
>> system.time({
>> grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges =
>> IRanges(start = 1, end = 10^7)))
>> })  # 4.7 secs
>>
>> ## Keeping gr and mat separate:
>> system.time({
>>    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
>> ranges = IRanges(start = 1, end = 10^7))))
>>    grOut2 <- list(gr[qh],  mat = mat[qh,])
>> }) #  0.9 secs
>>
>> ## Doing the above, but forming a GRange after the subsetting adds
>> just a small overhead
>> system.time({
>>    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
>> ranges = IRanges(start = 1, end = 10^7))))
>>    grOut3 = gr[qh]
>>    elementMetadata(grOut3) <- mat[qh,]
>> }) # 1.1 secs
>>
>>
>> * For really large query GRanges timing results (not shown) seems to
>> indicate that the line
>>    query[!is.na(findOverlaps(query, subject, maxgap = maxgap,
>>        minoverlap = minoverlap, type = type, select = "first"))]
>> in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be
>> made faster by changing it to
>>    query[queryHits(findOverlaps(query, subject, maxgap = maxgap,
>>        minoverlap = minoverlap, type = type))]
>> (my timing is for about a query length of 30M and a subject length of 1)
>> * I find it counter-intuitive that the following code doesn't work:
>> R> example(GRanges)
>> R> seqnames(gr) = "chr1"
>> or perhaps more intuitive
>> R> gr1 = gr[seqnames(gr) == "Chrom1"]
>> R> seqnames(gr1) = "chr1"
>> In general I don't like that seqnames is a factor.  This is fine when
>> I just consider a single GRange, but when I put together multiple
>> GRanges from different sources, I sometimes get into problems.
>> Specifically, in human, it is not always true that the various contigs
>> (like "chr10_random") are part of the data sources.  I would have
>> preferred it to be a character and then have seqlengths essentially be
>> NA in case a character does not appear in the names(seqlengths).  I
>> know that I have complained about this a long time ago and it is
>> probably too late to change, but still :)
>>
>>
> Just saw this.
> 
> Conceptually, a factor seems appropriate for sequence names. It does cause
> major headaches though. For example pintersect() requires the sequences
> names to be setequal(). But trying to fix the levels in seqnames is
> difficult, due to the validity constraint that the seqlengths need the same
> entries as seqnames levels. If there was someway to fix that without having
> to hack things using low-level slot accessors, it would be great.
> 
> Michael
> 
> 
> 
>>
>> Kasper
>>
>> data.frame2GRanges <- function(object, keepColumns = TRUE) {
>>    stopifnot(class(object) == "data.frame")
>>    stopifnot(all(c("chr", "start", "end") %in% names(object)))
>>    if("strand" %in% names(object)) {
>>        if(is.numeric(object$strand)) {
>>            strand <- ifelse(object$strand == 1, "+", "*")
>>            strand[object$strand == -1] <- "-"
>>            object$strand <- strand
>>        }
>>        gr <- GRanges(seqnames = object$chr,
>>                      ranges = IRanges(start = object$start, end =
>> object$end),
>>                      strand = object$strand)
>>    } else {
>>        gr <- GRanges(seqnames = object$chr,
>>                      ranges = IRanges(start = object$start, end =
>> object$end))
>>    }
>>    if(keepColumns) {
>>        dt <- as(object[, setdiff(names(object), c("chr", "start",
>> "end", "strand"))],
>>                     "DataFrame")
>>        elementMetadata(gr) <- dt
>>    }
>>    names(gr) <- rownames(object)
>>    gr
>> }
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor