[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Paul Geeleher paulgeeleher at gmail.com
Mon Sep 27 16:28:03 CEST 2010


Running it on a high end cluster, I estimated 41 hours based on how
long it took to do the first 10 genes. It's not a majorly big deal as
I can presumably run it in parallel for the different samples.

P

On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi Paul,
>
> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>> Okay great, thanks Steve, this works nicely. It does seem to take a
>> long (2 days!) time though but I guess this is to be expected! (I'm
>> assuming this is why they used the simple version in the tutorial...
>
> Wow ... that's ... a long time.
>
> What type of machine are you running this on, by the by?
>
> You know, I suspect it's so slow due to the slow speed involved in
> lapply-ing over GRangesList objects, which has been brought up before:
> http://thread.gmane.org/gmane.science.biology.informatics.conductor/30564
>
> If you do this sort of thing often, you might consider converting your
> GRangesList into a "normal" list of GRanges (or an appropriate
> IRangesList -- but you'd have to handle chromosome/strand info by
> yourself). I'll leave that as an exercise for the reader :-)
>
> -steve
>
>
>>
>> Paul
>>
>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>> Hi,
>>>
>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>> Hi,
>>>>
>>>> I've been going through this RNA-seq use case
>>>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>>>> with some data I have and I'm wondering about section 2.4 where they
>>>> calculate gene expression by counting the number of reads that alight
>>>> to within the boundaries of a genes, then normalize these based on the
>>>> length of the gene. Some of the code is as follows:
>>>>
>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>>>> levels(seqnames(alnRanges))]
>>>> head(dmGeneBounds, 3)
>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>>>>
>>>> My question is, is this actually correct, could you publish using this
>>>> method or is this just meant as a simple example?
>>>>
>>>> I'm interested in the ranks of the genes in the samples for a
>>>> subsequent analysis, but I would have assumed that you'd have to count
>>>> the number of reads that map to the EXONS of each gene and normalize
>>>> by the length of the EXONS, rather then the gene itself?
>>>>
>>>> If this is the case I wonder if there a tutorial that shows how to do that...
>>>
>>> If you get a bit comfortable with the GenomicFeatures package, you can
>>> do it pretty easily. Assume you have your reads stored in some GRanges
>>> object:
>>>
>>> txdb <- loadFeatures('/path/to/your/txdb')
>>> e <- exonsBy(txdb, 'gene')
>>> normedCounts <- lapply(e, function(x) {
>>>  countOverlaps(reads, x) / sum(width(x))
>>> })
>>>
>>> normedCounts now has the number of reads falling in each exon per gene
>>> divided by the total exon-length of the gene.
>>>
>>> You can run with that example to calculate a "proper" RPKM ...
>>>
>>> -steve
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>>
>>
>> --
>> Paul Geeleher
>> School of Mathematics, Statistics and Applied Mathematics
>> National University of Ireland
>> Galway
>> Ireland
>> --
>> www.bioinformaticstutorials.com
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>



-- 
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com



More information about the Bioconductor mailing list