[BioC] some random comments on GRanges
Martin Morgan
mtmorgan at fhcrc.org
Mon Sep 13 23:11:29 CEST 2010
Hi Kasper -- Mostly in agreement with what you say, with some caveats
below...
On 09/13/2010 11:57 AM, Michael Lawrence wrote:
> On Mon, Sep 13, 2010 at 6:16 AM, Kasper Daniel Hansen <
> kasperdanielhansen at gmail.com> wrote:
>
>> What follows is some random comments on the use of GRanges for storing
>> data in the elementMetadata slot. My use case is a lot of very small
>> ranges (>10M ranges of width 1), which may be a special border case
>> (but very important to me :). I am particular interested in fast
>> subsetting like subsetByOverlaps.
>>
>> Comments:
>>
>> * I find column subsetting on the elementMetadata to be really slow,
>> slower than it need to be I would say (I suspect some validity
>> checking, see comment below on row subsetting).
>> * There is no easy shortcut to get a specific metadata column (gr[,
>> "score"] returns a GRange). I would find it natural that gr$score
>> would do that. This is important in order to get at the data stores
>> in the GRange
To be sure we're on the same page, values(gr)$score get's what you want,
right?
In an earlier iteration on this theme Michael mentioned eSet. eSet
didn't strongly shape the discussion of GRanges, but since both pheno-
and feature-Data can have annotations I would rather that it had become
phenoData(eset)$x, featureData(eset)$y for symmetry and consistent
interface. Not sure that this is really relevant to GRanges.
>
>> * while there is a as.data.frame, as(from, "data.frame") does not work
>> and I would also appreciate an as(from, "GRanges") method with from
>> being a data.frame. I have a quick function at the bottom of my
>> email, for doing this.
Not so sure about this though, for reasons Michael mentions.
>> * Since start, end, width are reserved column names for GRanges, it
>> would be nice if the GRanges constructor was extended to allow
>> GRanges(seqnames = "chr1", start = 1, end = 10)
>> instead of now where we need
>> GRanges(seqnames = "chr1", ranges = IRanges(start = 1, end = 10))
>> (ok, this is pretty minor, but it is convenient)
Not too keen on this, either, as it (or supporting both, anyway) makes
for a complicated interface; glad it's a pretty minor issue ;)
Martin
>> * There is no sort method implemented for GRanges; it returns an error
>>
>> * is.unsorted(xx) seems to always return FALSE with a warning when xx
>> is an IRange or a GRange
>> *It seems to me that row subsetting, specifically subsetByOverlaps is
>> a bit too slow. The find overlaps part is blazingly fast (when ranges
>> is a single IRange); it seems like the initialization and subsetting
>> part takes "more time than necessary". This is based on comparing a
>> (long) GRanges with around 8 elementMetadata columns to another
>> approach where instead of using elementMetadata I use two elements of
>> a list: a GRanges without elementMetadata and where the
>> elementMetadata is stored in a separate data.frame - I then use
>> findOverlaps to get me the indices (when I compare using a GRanges to
>> using a GRanges and separate matrix, I use the "position" GRanges to
>> get my indexes, which I use to subset the matrix with).
>>
>> Specifically here gr is a GRanges with length 25M (no elementMetadata)
>> and mat is a 25M x 8 matrix
>>
>> grIn <- gr
>> elementMetadata(grIn) <- mat
>>
>> ## The "classic" approach:
>> system.time({
>> grOut1 <- subsetByOverlaps(grIn, GRanges(seqnames = "chr1", ranges =
>> IRanges(start = 1, end = 10^7)))
>> }) # 4.7 secs
>>
>> ## Keeping gr and mat separate:
>> system.time({
>> qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
>> ranges = IRanges(start = 1, end = 10^7))))
>> grOut2 <- list(gr[qh], mat = mat[qh,])
>> }) # 0.9 secs
>>
>> ## Doing the above, but forming a GRange after the subsetting adds
>> just a small overhead
>> system.time({
>> qh <- queryHits(findOverlaps(grIn, GRanges(seqnames = "chr1",
>> ranges = IRanges(start = 1, end = 10^7))))
>> grOut3 = gr[qh]
>> elementMetadata(grOut3) <- mat[qh,]
>> }) # 1.1 secs
>>
>>
>> * For really large query GRanges timing results (not shown) seems to
>> indicate that the line
>> query[!is.na(findOverlaps(query, subject, maxgap = maxgap,
>> minoverlap = minoverlap, type = type, select = "first"))]
>> in subsetByOverlaps(query = "GRanges", subject = "GRanges") could be
>> made faster by changing it to
>> query[queryHits(findOverlaps(query, subject, maxgap = maxgap,
>> minoverlap = minoverlap, type = type))]
>> (my timing is for about a query length of 30M and a subject length of 1)
>> * I find it counter-intuitive that the following code doesn't work:
>> R> example(GRanges)
>> R> seqnames(gr) = "chr1"
>> or perhaps more intuitive
>> R> gr1 = gr[seqnames(gr) == "Chrom1"]
>> R> seqnames(gr1) = "chr1"
>> In general I don't like that seqnames is a factor. This is fine when
>> I just consider a single GRange, but when I put together multiple
>> GRanges from different sources, I sometimes get into problems.
>> Specifically, in human, it is not always true that the various contigs
>> (like "chr10_random") are part of the data sources. I would have
>> preferred it to be a character and then have seqlengths essentially be
>> NA in case a character does not appear in the names(seqlengths). I
>> know that I have complained about this a long time ago and it is
>> probably too late to change, but still :)
>>
>>
> Just saw this.
>
> Conceptually, a factor seems appropriate for sequence names. It does cause
> major headaches though. For example pintersect() requires the sequences
> names to be setequal(). But trying to fix the levels in seqnames is
> difficult, due to the validity constraint that the seqlengths need the same
> entries as seqnames levels. If there was someway to fix that without having
> to hack things using low-level slot accessors, it would be great.
>
> Michael
>
>
>
>>
>> Kasper
>>
>> data.frame2GRanges <- function(object, keepColumns = TRUE) {
>> stopifnot(class(object) == "data.frame")
>> stopifnot(all(c("chr", "start", "end") %in% names(object)))
>> if("strand" %in% names(object)) {
>> if(is.numeric(object$strand)) {
>> strand <- ifelse(object$strand == 1, "+", "*")
>> strand[object$strand == -1] <- "-"
>> object$strand <- strand
>> }
>> gr <- GRanges(seqnames = object$chr,
>> ranges = IRanges(start = object$start, end =
>> object$end),
>> strand = object$strand)
>> } else {
>> gr <- GRanges(seqnames = object$chr,
>> ranges = IRanges(start = object$start, end =
>> object$end))
>> }
>> if(keepColumns) {
>> dt <- as(object[, setdiff(names(object), c("chr", "start",
>> "end", "strand"))],
>> "DataFrame")
>> elementMetadata(gr) <- dt
>> }
>> names(gr) <- rownames(object)
>> gr
>> }
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list