[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Tue Sep 28 18:28:24 CEST 2010

On 09/28/2010 08:09 AM, Paul Geeleher wrote:
> I guess my next question is if there is a convenient way of filtering
> out the part of the gene
> that overlaps another one using GenomicRanges?

The ingredients are the 'type' argument to countOverlaps and the ranged
based operations, nicely summarized on slide 14 and later of

http://bioconductor.org/help/course-materials/2009/SeattleNov09/IRanges/

Two examples (I would not call myself an IRanges expert) that do not
directly address your question but might be helpful anyway:

library(GenomicRanges)

##
## Example 1
##

## two 40mer reads, at 1070 and 2070
query <-
    GRanges("chr1", IRanges(c(1070, 2070) , width=40))

## three genes spanning 280 nt at 1000 (2 exons), 2000, 3000
subject <- GRangesList(
    GRanges("chr1", IRanges(c(1000, 1200), width=80)),
    GRanges("chr1", IRanges(2000, 2279)),
    GRanges("chr1", IRanges(3000, 3279)))

## simple count
o <- findOverlaps(query, subject)
(n0 <- tabulate(subjectHits(o), length(subject)))

## reads aligning to gaps don't count
gap <- endoapply(subject, function(x) gaps(x, min(start(x))))
o <- findOverlaps(query, gap)
(n1 <- n0 - tabulate(subjectHits(o), length(subject)))

##
## Example 2
##

## two 40mer reads, at 1070 and 2070
query <-
    GRanges("chr1", IRanges(c(1070, 2070) , width=40))

## two overlapping genes, spanning 280 nt at 1000 (2 exons), 1080
subject <- GRangesList(
    GRanges("chr1", IRanges(c(1000, 1200), width=80), "+"),
    GRanges("chr1", IRanges(1080, width=280), "-"))

## simple count -- read counted twice (probably not a good idea)
o <- findOverlaps(query, subject)
(n2 <- tabulate(subjectHits(o), length(subject)))

## simple count -- read divided (equally) between matches
o <- findOverlaps(query, subject)
nqhits <- tabulate(queryHits(o), length(query))
wt <- queryHits(o) / nqhits[queryHits(o)]
wtHits <- lapply(split(wt, subjectHits(o)), sum)
(n3 <- local({
    x <- numeric(length(subject))
    x[as.integer(names(wtHits))] <- as.numeric(wtHits)
    x
}))

Martin

> 
> It'd be a shame to have to rewrite all my code using Python or Genominator.
> 
> Paul.
> 
> On Mon, Sep 27, 2010 at 6:59 PM, Kasper Daniel Hansen
> <kasperdanielhansen at gmail.com> wrote:
>> You can also do all of this using Genominator, including dealing with
>> overlapping genes.  There should be ample documentation in the
>> vignettes, but you want to focus on
>>  summaryByAnnotation with the groupBy argument for aggregating from
>> exon to gene level
>>  makeGeneRepresentation in order to filter out the part of the gene
>> that overlaps another one
>>
>> Kasper
>>
>> On Mon, Sep 27, 2010 at 10:28 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>> Running it on a high end cluster, I estimated 41 hours based on how
>>> long it took to do the first 10 genes. It's not a majorly big deal as
>>> I can presumably run it in parallel for the different samples.
>>>
>>> P
>>>
>>> On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>> Hi Paul,
>>>>
>>>> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>>> Okay great, thanks Steve, this works nicely. It does seem to take a
>>>>> long (2 days!) time though but I guess this is to be expected! (I'm
>>>>> assuming this is why they used the simple version in the tutorial...
>>>>
>>>> Wow ... that's ... a long time.
>>>>
>>>> What type of machine are you running this on, by the by?
>>>>
>>>> You know, I suspect it's so slow due to the slow speed involved in
>>>> lapply-ing over GRangesList objects, which has been brought up before:
>>>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/30564
>>>>
>>>> If you do this sort of thing often, you might consider converting your
>>>> GRangesList into a "normal" list of GRanges (or an appropriate
>>>> IRangesList -- but you'd have to handle chromosome/strand info by
>>>> yourself). I'll leave that as an exercise for the reader :-)
>>>>
>>>> -steve
>>>>
>>>>
>>>>>
>>>>> Paul
>>>>>
>>>>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
>>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>>>>> Hi,
>>>>>>>
>>>>>>> I've been going through this RNA-seq use case
>>>>>>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>>>>>>> with some data I have and I'm wondering about section 2.4 where they
>>>>>>> calculate gene expression by counting the number of reads that alight
>>>>>>> to within the boundaries of a genes, then normalize these based on the
>>>>>>> length of the gene. Some of the code is as follows:
>>>>>>>
>>>>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>>>>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>>>>>>> levels(seqnames(alnRanges))]
>>>>>>> head(dmGeneBounds, 3)
>>>>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>>>>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>>>>>>>
>>>>>>> My question is, is this actually correct, could you publish using this
>>>>>>> method or is this just meant as a simple example?
>>>>>>>
>>>>>>> I'm interested in the ranks of the genes in the samples for a
>>>>>>> subsequent analysis, but I would have assumed that you'd have to count
>>>>>>> the number of reads that map to the EXONS of each gene and normalize
>>>>>>> by the length of the EXONS, rather then the gene itself?
>>>>>>>
>>>>>>> If this is the case I wonder if there a tutorial that shows how to do that...
>>>>>>
>>>>>> If you get a bit comfortable with the GenomicFeatures package, you can
>>>>>> do it pretty easily. Assume you have your reads stored in some GRanges
>>>>>> object:
>>>>>>
>>>>>> txdb <- loadFeatures('/path/to/your/txdb')
>>>>>> e <- exonsBy(txdb, 'gene')
>>>>>> normedCounts <- lapply(e, function(x) {
>>>>>>  countOverlaps(reads, x) / sum(width(x))
>>>>>> })
>>>>>>
>>>>>> normedCounts now has the number of reads falling in each exon per gene
>>>>>> divided by the total exon-length of the gene.
>>>>>>
>>>>>> You can run with that example to calculate a "proper" RPKM ...
>>>>>>
>>>>>> -steve
>>>>>>
>>>>>> --
>>>>>> Steve Lianoglou
>>>>>> Graduate Student: Computational Systems Biology
>>>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>>>  | Weill Medical College of Cornell University
>>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> Paul Geeleher
>>>>> School of Mathematics, Statistics and Applied Mathematics
>>>>> National University of Ireland
>>>>> Galway
>>>>> Ireland
>>>>> --
>>>>> www.bioinformaticstutorials.com
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>  | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>
>>>
>>>
>>> --
>>> Paul Geeleher
>>> School of Mathematics, Statistics and Applied Mathematics
>>> National University of Ireland
>>> Galway
>>> Ireland
>>> --
>>> www.bioinformaticstutorials.com
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at stat.math.ethz.ch
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>
>>
> 
> 
>