[BioC] function precede() not working with GRanges

Martin Morgan mtmorgan at fhcrc.org
Fri Jan 14 03:25:47 CET 2011


Thanks all for the input on this. We'll add a flag to the relevant
functions. It'll happen before the next release. And Kasper if you want
to initiate

> Furthermore, someone (and I am happy to do so), should explicitly
> write down how strand comparisons ought to be interpreted in a section
> in the vignette so we all have it for future reference.  This includes

that would be a really valuable addition

Martin

On 01/13/2011 05:30 AM, Kasper Daniel Hansen wrote:
> I seem to be a bit late in joining the discussion (been away), but I
> want to lend my support to Michael's suggestion (which I have been
> advocating in older emails).
> 
> If we differentiate between the 'true' signal and what comes out of a
> specific assay, it is clear that true signals can be stranded (say, a
> gene) or unstranded (nucleosome positions, histone modifications) and
> it is also possible to have an unstranded assay for a stranded true
> signal (the standard Illumina RNA-Seq prep).  On top of this, mapped
> reads are always stranded (at least, I don't know of an unstranded
> mapper) which may or may represent the signal.  So doing comparisons
> between stranded and unstranded data/annotation is a fundamental
> issue.
> 
> Now, it has been suggested in this thread and in the past that one
> could just change the strand of the objects we compare (and then do
> the comparison).  However, that is a destructive modification and will
> almost alway mean that we will have two objects lying around (the
> original one and the strand-modified one).  I don't like that both
> from a memory and a clutter perspective.  For example, while I could
> set my RNA-Seq reads to all have strand *, this precludes me from
> obtaining the original nucleotide sequence for the read if all I have
> is chr, start, end and no strand.  So now I need two sets of RNA-Seq
> reads....  Furthermore, I often have quite a few objects lying around
> representing different kinds of signals and data and some of these are
> strand specific, others are not.  So this really becomes an issue when
> you deal with an integrative analysis.
> 
> While it introduces an additional argument to many functions dealing
> with GRanges, I find that solution to be better, from a user
> perspective (it is clearly more work for a developer).
> 
> Furthermore, someone (and I am happy to do so), should explicitly
> write down how strand comparisons ought to be interpreted in a section
> in the vignette so we all have it for future reference.  This includes
> how to treat objects that have a mix of *, + and -.  An explicit
> reference will help us all make sure that all functions use the same
> underlying logic (comments in this thread seems to indicate that this
> is not the case).  Even if an additional ignore.strand argument is not
> implemented, it will be a nice section for future reference and bug
> hunting.  I believe it is important to be very precise about this.
> 
> In Genominator, we do have a ignoreStrand argument to all relevant
> functions and we also interpret a strand of * as "present on both
> strands" so that a 1-4,* overlaps 1-4,+.  (It is my understanding that
> right now, GenomicRanges essentially treat * as a separate strand).
> So my comments above about preferring an ignoreStrand argument from a
> user perspective is based on actual experience.
> 
> Kasper
> 
> 2011/1/12 Michael Lawrence <lawrence.michael at gene.com>:
>> 2011/1/12 Hervé Pagès <hpages at fhcrc.org>
>>
>>> Hi Michael,
>>>
>>> On 01/12/2011 05:45 AM, Michael Lawrence wrote:
>>>
>>>>
>>>>
>>>> 2011/1/12 Hervé Pagès <hpages at fhcrc.org <mailto:hpages at fhcrc.org>>
>>>>
>>>>
>>>>    Hi Steve,
>>>>
>>>>
>>>>    On 01/11/2011 07:20 PM, Steve Lianoglou wrote:
>>>>
>>>>        Hi,
>>>>
>>>>        I'm not Michael, but:
>>>>
>>>>            How general is the "peaks do not have a direction" statement?
>>>>            In my experience (RNA-seq experiments), peaks are "stranded
>>>>            features"
>>>>            i.e. they belong to a particular strand.
>>>>
>>>>
>>>>        I'm actually a bit surprised by that. It was my impression that
>>>> most
>>>>        of the rna-seq data to date have been done with a protocol that
>>>>        doesn't honor the strand of the reads.
>>>>
>>>>
>>>>    Just to clarify, I was not talking about protocols that honor or not
>>>>    the original strand of the reads. I was talking about the strand to
>>>>    which each read aligns. This information is stored in the SAM/BAM file
>>>>    and propagates to the GRanges object you get when loading this file
>>>>    (with something like 'as(read.GappedAlignments(myfile), "GRanges")'.
>>>>    The strand information is here whatever type of *-seq experiment it
>>>>    is (RNA-seq, CHIP-seq, ...)
>>>>
>>>>
>>>> Right, but if you're getting reads from both strands of a transcript,
>>>> you certainly don't want to ignore the reads coming from the strand
>>>> opposite transcription.
>>>>
>>>
>>> Point taken. I somehow overlooked the fact that with a non-stranded
>>> RNA-seq protocol, whatever strand a transcript is coming from, it will
>>> generate reads that align to both strands. Thanks for the reminder!
>>>
>>>
>>>  Steve is right about stranded protocols becoming
>>>> more popular, but in many cases their advantages might not outweigh the
>>>> increased cost/complexity. Time will tell, I guess. Anyway, let's not
>>>> make GenomicRanges the RNA-seq package. Suffice it to say that we deal
>>>> RNA-seq, ChIP-seq, other DNA-seqs and up till now we always need to
>>>> remove strand before looking for overlaps.
>>>>
>>>
>>> Why only before looking for overlaps? I mean, if the strand is
>>> meaningless, why not just remove the strand from the very beginning
>>> i.e. when you make your GRanges object from your SAM/BAM file (like
>>> Steve does)? So you only need to do strand(gr) <- "*" once. Isn't
>>> that easier/safer than having to remember to set an additional
>>> 'ignore.strand' parameter to FALSE every time you need to call a
>>> strand-aware function like findOverlaps?
>>>
>>>
>> For RNA-seq, that's reasonable. There are cases where the strand of the
>> reads is important. Fragment length estimation algorithms, mostly designed
>> for enrichment experiments, use the strand (see chipseq
>> estimate.mean.fraglen). For ChIP-seq, we detect peaks before doing any
>> overlaps, but for MEDIP-seq, we deal with reads. Anyway, the point is that
>> there are a variety of use cases. I'll admit that an extra parameter is not
>> entirely justified at this point. Just brought this up for discussion.
>>
>> Thanks,
>> Michael
>>
>>
>>
>>> Adding an 'ignore.strand' argument to findOverlaps/countOverlaps (and
>>> probably to other strand-aware GRanges methods if we start to go that
>>> route) is not a big deal (just 2 lines of code in each method), but
>>> still, that means adding yet another argument (findOverlaps already
>>> has 9), and documenting it for all these methods. I'm just wondering
>>> if it's worth it...
>>>
>>> H.
>>>
>>>
>>>>
>>>>
>>>>        Recently, I've come across several protocols where strandedness is
>>>>        maintained and I'm sure that these will be increasingly the norm as
>>>>        time goes on.
>>>>
>>>>        I've been working on/with different variants of some (rna)
>>>>        tag-sequencing protocols, and these do typically keep strand
>>>>        info, but
>>>>        I think the "old" Illumina-driven transcriptome-wide rna-seq
>>>>        data that
>>>>        "most" people think of as RNA-seq is unstranded.
>>>>
>>>>        Still. I like that GenomicRanges can honor strandedness in
>>>>        overlap/etc. functions when the strand is set.
>>>>
>>>>                One thing that often confuses me with GenomicRanges is
>>>>                the dual meaning of
>>>>                strand. It seems to indicate both a direction and a
>>>>                coordinate. It makes
>>>>                sense to me to have resize() and precede() consider
>>>>                strand as a direction.
>>>>                But findOverlaps() treats strands as coordinates such
>>>>                that items on
>>>>                different strands do not overlap. This makes less sense.
>>>>
>>>>
>>>>            Why? IMO a positive read (i.e. a read that was aligned to
>>>>            the plus
>>>>            strand) shouldn't hit features (genes/transcripts/exons)
>>>>            that are
>>>>            located on the minus strand. If for whatever reason this is
>>>>            not what
>>>>            the user wants, then s/he can always set the strand of the
>>>>            query and
>>>>            subject to '*' but I wouldn't say this is the typical usecase.
>>>>
>>>>                Another good
>>>>                example is gaps() which will yield sequence-long ranges
>>>>                on the strands not
>>>>                represented in the original object. This is very often
>>>>                undesirable. But of
>>>>                course my perspective is limited.
>>>>
>>>>
>>>>            I agree that the behaviour of gaps() on GRanges is awkward.
>>>>            You not only
>>>>            get those sequence-long ranges on the strands not
>>>>            represented in the
>>>>            original object but you also get them for sequences not
>>>>            represented at
>>>>            all (as long as they are listed in levels(seqnames(x))). But
>>>>            if gaps()
>>>>            wasn't doing this, gaps() wouldn't be reversible anymore (i.e.
>>>>            'gaps(gaps(x))' wouldn't be indentical to 'x') which is kind
>>>>            of an
>>>>            important property for gaps(). In particular, gaps(x) would
>>>> give
>>>>            the same result whether 'x' as a sequence-long range on one
>>>>            strand
>>>>            or not (very unlikely to happen with real data though).
>>>>
>>>>            Maybe an extra arg to gaps() would help control this?
>>>>            Suggestions are
>>>>            welcome. I'll make the change. Thanks!
>>>>
>>>>
>>>>        I'd be in favor of adding a param to GRanges::gaps() that would
>>>>        allow
>>>>        me to explicitly turn off this "feature" ;-)
>>>>
>>>>
>>>>    OK, will see what I can do. Thanks for your feedback!
>>>>
>>>>    H.
>>>>
>>>>
>>>>        -steve
>>>>
>>>>
>>>>
>>>>    --
>>>>    Hervé Pagès
>>>>
>>>>    Program in Computational Biology
>>>>    Division of Public Health Sciences
>>>>    Fred Hutchinson Cancer Research Center
>>>>    1100 Fairview Ave. N, M2-B876
>>>>    P.O. Box 19024
>>>>    Seattle, WA 98109-1024
>>>>
>>>>    E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
>>>>
>>>>    Phone:  (206) 667-5791
>>>>    Fax:    (206) 667-1319
>>>>
>>>>
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M2-B876
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone:  (206) 667-5791
>>> Fax:    (206) 667-1319
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list