[BioC] questions about Overlap encodings

Thu May 10 01:54:42 CEST 2012

Hi Henry,

On 05/08/2012 09:18 AM, Henry Paik wrote:
> Hi List and Hervé,
>
> First of all, thank you very much for all your hard work on Bioconductor.
>
> I have some questions about OverlapEncodings

You are more than welcome to have a look at the "overlap encodings"
tools that I've started to introduce in BioC 2.10 (in IRanges 1.14
and GenomicRanges 1.8). However please note that what is in BioC 2.10
is incomplete and very rough around the edges. I'm still actively
working on the "overlap encodings" stuff but the improvements are
only making it to the devel versions of IRanges/GenomicRanges.
If you are still interested in trying those tools, please consider
using BioC 2.11 (see ?useDevel in the BiocInstaller package for how
to do this). In particular please make sure that you always use the
latest version of the GenomicRanges package (currently 1.9.14),
and expect frequent updates.

>
> 1. OverlapEncodings object looks like below.
>
>  > ovenc1
> OverlapEncodings of length 2
> Loffset Roffset encoding
> [1] 4 0 2:jm:af:
> [2] 4 0 2:jm:af:
>
> Under encoding, 2 is the number of exons in this junction. What do jm
> and af (or i, am, gm,...) mean?

Encodings are a little bit tricky to interpret. I've tried to do this in
the OverlapEncodings man page (in IRanges) but this is the typical
situation where a drawing will do better than a long explanation.

Encoding 2:jm:af: is pretty common and is explained (with a drawing) in
the "Overlap encodings" vignette (in the GenomicRanges package, I highly
recommend you have a look at this vignette). It corresponds to this
situation:

   - read (1 gap):             ooooo---ooo
   - transcript:     ...   >>>>>>>>>   >>>>>>>>>   ...

As you can see, the gap in the read is "compatible" with the splicing of
the transcript. The isCompatibleWithSplicing() utility can be used to
detect compatible overlaps (see the vignette for the details).

So even though here the 2 at the beginning of the encoding can be seen
as the number of exons in the junction, more generally speaking this
number is just the number of "chunks" of the aligned read i.e. the
number of chunks that the aligner broke the read into in order to align
it. In other words, this number is always the number of gaps + 1. Note
that it's also the number of 1-letter codes between 2 consecutive
colons (:) in the encoding.

>
> 2. How could I know which exons are involved to a junction? or how could
> I know the exon ranks from OverlapEncodings?
> extractSpannedExonRanks function is not implemented?

extractSpannedExonRanks() is implemented in GenomicRanges 1.9 (and its
use is covered in the vignette found in this package). Note that for
overlaps that are "compatible" with the splicing of the transcript, yes
it does return the ranks of the exons involved in the overlap. Also note
that a "compatible" overlap can span more than 1 junction i.e. more
than 2 exons, e.g. the 3:jmm:agm:aaf: encoding:

   - read (2 gaps):              oo---ooooo---o
   - transcript:     ...   >>>>>>>>   >>>>>   >>>>>>>   ...

In that case "compatible" means that all the gaps in the alignment are
compatible with consecutive junctions in the transcript. See the
vignette for more details.

>
> When I did
> ?extractSpannedExonRanks
> I got
> No documentation for ‘extractSpannedExonRanks’

That should work with GenomicRanges 1.9. But the man page doesn't
contain much yet, sorry. The vignette has a lot more material and is
my first priority at the moment. I want to cover and describe some
typical use cases in the vignette first and add/adjust whatever
low-level tools are needed for that in the GenomicRanges/IRanges
infrastructure. The exact set of low-level tools is still subject
to frequent changes (at least potentially) so I'm waiting things
to stabilize before I write detailed man pages for them.

>
> 3. How could I know the number of unique reads on unique junctions?

It all depends what having "a read on a junction" means exactly.

If having "a read on a junction" means that the aligned read has a
gap (i.e. N in the cigar) that matches exactly a splice junction,
then note you are not looking at the entire overlap between the read
and the transcript and you will count a hit even if the full read is
not "compatible" with the exon structure of the transcript. If that's
what you want to do, then you don't really need overlap encodings.

One way to do this:

   (1) Extract the unique junctions from 'tx' (the GRangesList object
       containing the transcripts i.e. the 'subject' you passed to
       findOverlaps):

         tx_ranges <- range(tx)
         table(elementLengths(tx_ranges))  # sanity check (a single entry)
         tx_junctions <- psetdiff(unlist(tx_ranges, use.names=FALSE), tx)

         stopifnot(identical(elementLengths(tx_junctions) + 1L,
                             elementLengths(tx)))

       At this point, the junctions are still grouped by transcript so
       junctions shared by several transcripts are repeated. If you
       really want unique junctions:

         unq_tx_junctions <- unique(unlist(tx_junctions, use.names=FALSE))

   (2) Extract the unique gaps from the 'reads' (the GRangesList object
       containing the reads i.e. the 'query' you passed to findOverlaps).
       Make sure the names on 'reads' are set to the "Query template
       names" (this should be the case if you made it with 'grglist(gal,
       order.as.in.query=TRUE)' and if 'gal' itself was made with
       'readGappedAlignments( ... , use.names=TRUE)'):

         reads_ranges <- range(reads)
         table(elementLengths(reads_ranges))  # sanity check (a single 
entry)
         reads_gaps <- psetdiff(unlist(reads_ranges, use.names=FALSE), 
reads)

         stopifnot(identical(elementLengths(reads_gaps) + 1L,
                             elementLengths(reads)))

         all_reads_gaps <- unlist(reads_gaps, use.names=FALSE)
         names(all_reads_gaps) <- rep.int(names(reads_gaps), 
elementLengths(reads_gaps))

         ## Gaps with the same name belong to the same read. If 2
         ## elements in 'all_reads_gaps' have the same name (i.e.
         ## belong to the same read) and represent the same genomic
         ## range, then they are duplicates. Let's get rid of those
         ## duplicates:
         is_dup <- duplicated(
                     data.frame(
                       a=names(all_reads_gaps),
                       b=as.factor(seqnames(all_reads_gaps)),
                       c=as.factor(strand(all_reads_gaps)),
                       d=start(all_reads_gaps),
                       e=width(all_reads_gaps)))

         all_reads_gaps <- all_reads_gaps[!is_dup]

     (3) Count the number of unique reads per unique junction:

         countOverlaps(unq_tx_junctions, all_reads_gaps, type="equal")

But if you also want to take into account "compatibility" between the
read and the transcript for doing this kind of counting, then counting
"the number of unique reads on *unique* junctions" becomes a more
complicated problem. This is because you need to keep each junction in
the context of its transcript in order to tell whether there is
compatibility or not and at the same time you loose that context when
you tabulate on "unique junctions". Something I could work on in the
vignette though...

> I can filter to get the unique match for each read before bioC analysis.
> But to quantify the reads on the junctions, I think I need the count of
> the unique read_id on unique junction?
>
> I am learning all these from "Overlap encodings" vignette. Are there any
> other documentation to learn more about this?

That's the only place to look at for the moment. Again, make sure you
look at the vignette that is in the *devel* version of the GenomicRanges
package. Please don't hesitate to send me your feedback on this.

Thanks,
H.

>
> Thanks much!
>
> - Henry
>
>  > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] Rsamtools_1.8.4 Biostrings_2.24.1 GenomicFeatures_1.8.1
> [4] AnnotationDbi_1.18.0 Biobase_2.16.0 GenomicRanges_1.8.5
> [7] IRanges_1.14.2 BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.12.0 bitops_1.0-4.1 BSgenome_1.24.0 compiler_2.15.0
> [5] DBI_0.2-5 RCurl_1.91-1 RSQLite_0.11.1 rtracklayer_1.16.1
> [9] stats4_2.15.0 tools_2.15.0 XML_3.9-4 zlibbioc_1.2.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319