[BioC] some random comments on GRanges

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Mon Sep 13 15:16:48 CEST 2010


What follows is some random comments on the use of GRanges for storing
data in the elementMetadata slot.  My use case is a lot of very small
ranges (>10M ranges of width 1), which may be a special border case
(but very important to me :).  I am particular interested in fast
subsetting like subsetByOverlaps.

Comments:

* I find column subsetting on the elementMetadata to be really slow,
slower than it need to be I would say (I suspect some validity
checking, see comment below on row subsetting).

* There is no easy shortcut to get a specific metadata column (gr[,
"score"] returns a GRange).  I would find it natural that gr$score
would do that.  This is important in order to get at the data stores
in the GRange

* while there is a as.data.frame, as(from, "data.frame") does not work
and I would also appreciate an as(from, "GRanges") method with from
being a data.frame.  I have a quick function at the bottom of my
email, for doing this.

* Since start, end, width are reserved column names for GRanges, it
would be nice if the GRanges constructor was extended to allow
  GRanges(seqnames = "chr1", start = 1, end = 10)
instead of now where we need
  GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10))
(ok, this is pretty minor, but it is convenient)

* There is no sort method implemented for GRanges; it returns an error

* is.unsorted(xx) seems to always return FALSE with a warning when xx
is an IRange or a GRange

*It seems to me that row subsetting, specifically subsetByOverlaps is
a bit too slow.  The find overlaps part is blazingly fast (when ranges
is a single IRange); it seems like the initialization and subsetting
part takes "more time than necessary".  This is based on comparing a
(long) GRanges with around 8 elementMetadata columns to another
approach where instead of using elementMetadata I use two elements of
a list: a GRanges without elementMetadata and where the
elementMetadata is stored in a separate data.frame - I then use
findOverlaps to get me the indices (when I compare using a GRanges to
using a GRanges and separate matrix, I use the "position" GRanges to
get my indexes, which I use to subset the matrix with).

Specifically here gr is a GRanges with length 25M (no elementMetadata)
and mat is a 25M x 8 matrix

grIn <- gr
elementMetadata(grIn) <- mat

## The "classic" approach:
system.time({
grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges =
IRanges(start = 1, end = 10^7)))
})  # 4.7 secs

## Keeping gr and mat separate:
system.time({
    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
ranges = IRanges(start = 1, end = 10^7))))
    grOut2 <- list(gr[qh],  mat = mat[qh,])
}) #  0.9 secs

## Doing the above, but forming a GRange after the subsetting adds
just a small overhead
system.time({
    qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
ranges = IRanges(start = 1, end = 10^7))))
    grOut3 = gr[qh]
    elementMetadata(grOut3) <- mat[qh,]
}) # 1.1 secs


* For really large query GRanges timing results (not shown) seems to
indicate that the line
    query[!is.na(findOverlaps(query, subject, maxgap = maxgap,
        minoverlap = minoverlap, type = type, select = "first"))]
in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be
made faster by changing it to
    query[queryHits(findOverlaps(query, subject, maxgap = maxgap,
        minoverlap = minoverlap, type = type))]
(my timing is for about a query length of 30M and a subject length of 1)

* I find it counter-intuitive that the following code doesn't work:
R> example(GRanges)
R> seqnames(gr) = "chr1"
or perhaps more intuitive
R> gr1 = gr[seqnames(gr) == "Chrom1"]
R> seqnames(gr1) = "chr1"
In general I don't like that seqnames is a factor.  This is fine when
I just consider a single GRange, but when I put together multiple
GRanges from different sources, I sometimes get into problems.
Specifically, in human, it is not always true that the various contigs
(like "chr10_random") are part of the data sources.  I would have
preferred it to be a character and then have seqlengths essentially be
NA in case a character does not appear in the names(seqlengths).  I
know that I have complained about this a long time ago and it is
probably too late to change, but still :)


Kasper

data.frame2GRanges <- function(object, keepColumns = TRUE) {
    stopifnot(class(object) == "data.frame")
    stopifnot(all(c("chr", "start", "end") %in% names(object)))
    if("strand" %in% names(object)) {
        if(is.numeric(object$strand)) {
            strand <- ifelse(object$strand == 1, "+", "*")
            strand[object$strand == -1] <- "-"
            object$strand <- strand
        }
        gr <- GRanges(seqnames = object$chr,
                      ranges = IRanges(start = object$start, end = object$end),
                      strand = object$strand)
    } else {
        gr <- GRanges(seqnames = object$chr,
                      ranges = IRanges(start = object$start, end = object$end))
    }
    if(keepColumns) {
        dt <- as(object[, setdiff(names(object), c("chr", "start",
"end", "strand"))],
                     "DataFrame")
        elementMetadata(gr) <- dt
    }
    names(gr) <- rownames(object)
    gr
}



More information about the Bioconductor mailing list