[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Mon Sep 27 19:31:34 CEST 2010

On 09/27/2010 07:28 AM, Paul Geeleher wrote:
> Running it on a high end cluster, I estimated 41 hours based on how
> long it took to do the first 10 genes. It's not a majorly big deal as
> I can presumably run it in parallel for the different samples.

> 
> P
> 
> On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou
> <mailinglist.honeypot at gmail.com> wrote:
>> Hi Paul,
>>
>> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>> Okay great, thanks Steve, this works nicely. It does seem to take a
>>> long (2 days!) time though but I guess this is to be expected! (I'm
>>> assuming this is why they used the simple version in the tutorial...
>>
>> Wow ... that's ... a long time.
>>
>> What type of machine are you running this on, by the by?
>>
>> You know, I suspect it's so slow due to the slow speed involved in
>> lapply-ing over GRangesList objects, which has been brought up before:
>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/30564

'Slow' might be measured on the minutes rather than days time frame,
unless you're running out of memory. Clusters often have relatively
small nodes and so the approach is to work with smaller chunks on each
node. If you've got lots of memory and it's still taking days, then
posting (simplified but still slow) code might suggest some insights.

We'd like to address the GRangesList performance issue in time for the
forthcoming release.

Martin

>>
>> If you do this sort of thing often, you might consider converting your
>> GRangesList into a "normal" list of GRanges (or an appropriate
>> IRangesList -- but you'd have to handle chromosome/strand info by
>> yourself). I'll leave that as an exercise for the reader :-)
>>
>> -steve
>>
>>
>>>
>>> Paul
>>>
>>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
>>> <mailinglist.honeypot at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> I've been going through this RNA-seq use case
>>>>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>>>>> with some data I have and I'm wondering about section 2.4 where they
>>>>> calculate gene expression by counting the number of reads that alight
>>>>> to within the boundaries of a genes, then normalize these based on the
>>>>> length of the gene. Some of the code is as follows:
>>>>>
>>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>>>>> levels(seqnames(alnRanges))]
>>>>> head(dmGeneBounds, 3)
>>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>>>>>
>>>>> My question is, is this actually correct, could you publish using this
>>>>> method or is this just meant as a simple example?
>>>>>
>>>>> I'm interested in the ranks of the genes in the samples for a
>>>>> subsequent analysis, but I would have assumed that you'd have to count
>>>>> the number of reads that map to the EXONS of each gene and normalize
>>>>> by the length of the EXONS, rather then the gene itself?
>>>>>
>>>>> If this is the case I wonder if there a tutorial that shows how to do that...
>>>>
>>>> If you get a bit comfortable with the GenomicFeatures package, you can
>>>> do it pretty easily. Assume you have your reads stored in some GRanges
>>>> object:
>>>>
>>>> txdb <- loadFeatures('/path/to/your/txdb')
>>>> e <- exonsBy(txdb, 'gene')
>>>> normedCounts <- lapply(e, function(x) {
>>>>  countOverlaps(reads, x) / sum(width(x))
>>>> })
>>>>
>>>> normedCounts now has the number of reads falling in each exon per gene
>>>> divided by the total exon-length of the gene.
>>>>
>>>> You can run with that example to calculate a "proper" RPKM ...
>>>>
>>>> -steve
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>  | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>
>>>
>>>
>>>
>>> --
>>> Paul Geeleher
>>> School of Mathematics, Statistics and Applied Mathematics
>>> National University of Ireland
>>> Galway
>>> Ireland
>>> --
>>> www.bioinformaticstutorials.com
>>>
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
> 
> 
>