[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Mon Sep 27 16:24:37 CEST 2010

Hi Paul,

On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
> Okay great, thanks Steve, this works nicely. It does seem to take a
> long (2 days!) time though but I guess this is to be expected! (I'm
> assuming this is why they used the simple version in the tutorial...

Wow ... that's ... a long time.

What type of machine are you running this on, by the by?

You know, I suspect it's so slow due to the slow speed involved in
lapply-ing over GRangesList objects, which has been brought up before:
http://thread.gmane.org/gmane.science.biology.informatics.conductor/30564

If you do this sort of thing often, you might consider converting your
GRangesList into a "normal" list of GRanges (or an appropriate
IRangesList -- but you'd have to handle chromosome/strand info by
yourself). I'll leave that as an exercise for the reader :-)

-steve

>
> Paul
>
> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi,
>>
>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>> Hi,
>>>
>>> I've been going through this RNA-seq use case
>>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>>> with some data I have and I'm wondering about section 2.4 where they
>>> calculate gene expression by counting the number of reads that alight
>>> to within the boundaries of a genes, then normalize these based on the
>>> length of the gene. Some of the code is as follows:
>>>
>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>>> levels(seqnames(alnRanges))]
>>> head(dmGeneBounds, 3)
>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>>>
>>> My question is, is this actually correct, could you publish using this
>>> method or is this just meant as a simple example?
>>>
>>> I'm interested in the ranks of the genes in the samples for a
>>> subsequent analysis, but I would have assumed that you'd have to count
>>> the number of reads that map to the EXONS of each gene and normalize
>>> by the length of the EXONS, rather then the gene itself?
>>>
>>> If this is the case I wonder if there a tutorial that shows how to do that...
>>
>> If you get a bit comfortable with the GenomicFeatures package, you can
>> do it pretty easily. Assume you have your reads stored in some GRanges
>> object:
>>
>> txdb <- loadFeatures('/path/to/your/txdb')
>> e <- exonsBy(txdb, 'gene')
>> normedCounts <- lapply(e, function(x) {
>>  countOverlaps(reads, x) / sum(width(x))
>> })
>>
>> normedCounts now has the number of reads falling in each exon per gene
>> divided by the total exon-length of the gene.
>>
>> You can run with that example to calculate a "proper" RPKM ...
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>
> --
> Paul Geeleher
> School of Mathematics, Statistics and Applied Mathematics
> National University of Ireland
> Galway
> Ireland
> --
> www.bioinformaticstutorials.com
>

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact