[BioC] Question about CSAMA10 "Lab-8-RNAseqUseCase.pdf" tutorial on bioconductor website.

Paul Geeleher paulgeeleher at gmail.com
Tue Sep 28 17:09:23 CEST 2010


I guess my next question is if there is a convenient way of filtering
out the part of the gene
that overlaps another one using GenomicRanges?

It'd be a shame to have to rewrite all my code using Python or Genominator.

Paul.

On Mon, Sep 27, 2010 at 6:59 PM, Kasper Daniel Hansen
<kasperdanielhansen at gmail.com> wrote:
> You can also do all of this using Genominator, including dealing with
> overlapping genes.  There should be ample documentation in the
> vignettes, but you want to focus on
>  summaryByAnnotation with the groupBy argument for aggregating from
> exon to gene level
>  makeGeneRepresentation in order to filter out the part of the gene
> that overlaps another one
>
> Kasper
>
> On Mon, Sep 27, 2010 at 10:28 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>> Running it on a high end cluster, I estimated 41 hours based on how
>> long it took to do the first 10 genes. It's not a majorly big deal as
>> I can presumably run it in parallel for the different samples.
>>
>> P
>>
>> On Mon, Sep 27, 2010 at 3:24 PM, Steve Lianoglou
>> <mailinglist.honeypot at gmail.com> wrote:
>>> Hi Paul,
>>>
>>> On Mon, Sep 27, 2010 at 10:08 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>> Okay great, thanks Steve, this works nicely. It does seem to take a
>>>> long (2 days!) time though but I guess this is to be expected! (I'm
>>>> assuming this is why they used the simple version in the tutorial...
>>>
>>> Wow ... that's ... a long time.
>>>
>>> What type of machine are you running this on, by the by?
>>>
>>> You know, I suspect it's so slow due to the slow speed involved in
>>> lapply-ing over GRangesList objects, which has been brought up before:
>>> http://thread.gmane.org/gmane.science.biology.informatics.conductor/30564
>>>
>>> If you do this sort of thing often, you might consider converting your
>>> GRangesList into a "normal" list of GRanges (or an appropriate
>>> IRangesList -- but you'd have to handle chromosome/strand info by
>>> yourself). I'll leave that as an exercise for the reader :-)
>>>
>>> -steve
>>>
>>>
>>>>
>>>> Paul
>>>>
>>>> On Thu, Sep 23, 2010 at 4:57 PM, Steve Lianoglou
>>>> <mailinglist.honeypot at gmail.com> wrote:
>>>>> Hi,
>>>>>
>>>>> On Thu, Sep 23, 2010 at 11:06 AM, Paul Geeleher <paulgeeleher at gmail.com> wrote:
>>>>>> Hi,
>>>>>>
>>>>>> I've been going through this RNA-seq use case
>>>>>> (http://bioconductor.org/help/course-materials/2010/CSAMA10/Lab-8-RNAseqUseCase.pdf)
>>>>>> with some data I have and I'm wondering about section 2.4 where they
>>>>>> calculate gene expression by counting the number of reads that alight
>>>>>> to within the boundaries of a genes, then normalize these based on the
>>>>>> length of the gene. Some of the code is as follows:
>>>>>>
>>>>>> dmGeneBounds <- CSAMA10::geneBounds(dmTxDb)
>>>>>> dmGeneBounds <- dmGeneBounds[seqnames(dmGeneBounds) %in%
>>>>>> levels(seqnames(alnRanges))]
>>>>>> head(dmGeneBounds, 3)
>>>>>> dmGeneCounts <- countOverlaps(dmGeneBounds, alnRanges)
>>>>>> dmRPKM <- CSAMA10::rpkm(dmGeneCounts, dmGeneBounds)
>>>>>>
>>>>>> My question is, is this actually correct, could you publish using this
>>>>>> method or is this just meant as a simple example?
>>>>>>
>>>>>> I'm interested in the ranks of the genes in the samples for a
>>>>>> subsequent analysis, but I would have assumed that you'd have to count
>>>>>> the number of reads that map to the EXONS of each gene and normalize
>>>>>> by the length of the EXONS, rather then the gene itself?
>>>>>>
>>>>>> If this is the case I wonder if there a tutorial that shows how to do that...
>>>>>
>>>>> If you get a bit comfortable with the GenomicFeatures package, you can
>>>>> do it pretty easily. Assume you have your reads stored in some GRanges
>>>>> object:
>>>>>
>>>>> txdb <- loadFeatures('/path/to/your/txdb')
>>>>> e <- exonsBy(txdb, 'gene')
>>>>> normedCounts <- lapply(e, function(x) {
>>>>>  countOverlaps(reads, x) / sum(width(x))
>>>>> })
>>>>>
>>>>> normedCounts now has the number of reads falling in each exon per gene
>>>>> divided by the total exon-length of the gene.
>>>>>
>>>>> You can run with that example to calculate a "proper" RPKM ...
>>>>>
>>>>> -steve
>>>>>
>>>>> --
>>>>> Steve Lianoglou
>>>>> Graduate Student: Computational Systems Biology
>>>>>  | Memorial Sloan-Kettering Cancer Center
>>>>>  | Weill Medical College of Cornell University
>>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Paul Geeleher
>>>> School of Mathematics, Statistics and Applied Mathematics
>>>> National University of Ireland
>>>> Galway
>>>> Ireland
>>>> --
>>>> www.bioinformaticstutorials.com
>>>>
>>>
>>>
>>>
>>> --
>>> Steve Lianoglou
>>> Graduate Student: Computational Systems Biology
>>>  | Memorial Sloan-Kettering Cancer Center
>>>  | Weill Medical College of Cornell University
>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>>
>>
>> --
>> Paul Geeleher
>> School of Mathematics, Statistics and Applied Mathematics
>> National University of Ireland
>> Galway
>> Ireland
>> --
>> www.bioinformaticstutorials.com
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>



-- 
Paul Geeleher
School of Mathematics, Statistics and Applied Mathematics
National University of Ireland
Galway
Ireland
--
www.bioinformaticstutorials.com



More information about the Bioconductor mailing list