[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Paul Geeleher paulgeeleher at gmail.com
Mon Sep 27 16:08:49 CEST 2010

Okay great, thanks Steve, this works nicely. It does seem to take a
long (2 days!) time though but I guess this is to be expected! (I'm
assuming this is why they used the simple version in the tutorial...


On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
<mailinglist.honeypot at gmail.com> wrote:
> Hi,
> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>> Hi,
>> I've been going through this RNA-seq use case
>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>> with some data I have and I'm wondering about section 2.4 where they
>> calculate gene expression by counting the number of reads that alight
>> to within the boundaries of a genes, then normalize these based on the
>> length of the gene. Some of the code is as follows:
>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>> levels(seqnames(alnRanges))]
>> head(dmGeneBounds, 3)
>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>> My question is, is this actually correct, could you publish using this
>> method or is this just meant as a simple example?
>> I'm interested in the ranks of the genes in the samples for a
>> subsequent analysis, but I would have assumed that you'd have to count
>> the number of reads that map to the EXONS of each gene and normalize
>> by the length of the EXONS, rather then the gene itself?
>> If this is the case I wonder if there a tutorial that shows how to do that...
> If you get a bit comfortable with the GenomicFeatures package, you can
> do it pretty easily. Assume you have your reads stored in some GRanges
> object:
> txdb <- loadFeatures('/path/to/your/txdb')
> e <- exonsBy(txdb, 'gene')
> normedCounts <- lapply(e, function(x) {
>  countOverlaps(reads, x) / sum(width(x))
> })
> normedCounts now has the number of reads falling in each exon per gene
> divided by the total exon-length of the gene.
> You can run with that example to calculate a "proper" RPKM ...
> -steve
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact

Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland

More information about the Bioconductor mailing list