[BioC] how to calculate gene length to be used in rpkm() in edgeR

Gordon K Smyth smyth at wehi.EDU.AU
Sun May 4 06:15:59 CEST 2014


Hi Ryan and Shirley,

The appropriate gene length should match the method and annotation that 
was used to count the reads.

I'm assuming that the counting method and annotation used for the new data 
A might differ from that used for data B, so the appropriate gene lengths 
might not be the same.

The software used to count the reads should also return the appropriate 
gene length.  For example, here is a case study showing how gene lengths 
are returned by the featureCounts function and used to compute rpkm in 
edgeR:

  http://bioinf.wehi.edu.au/RNAseqCaseStudy

In the latest version of edgeR, the rpkm() will even find the gene lengths 
automatically in the DGEList object.  In this case study, the gene length 
is defined to be the total length of all exons in the gene, including the 
3'UTR, because featureCounts counts all reads that overlap any exon.

Best wishes
Gordon


> Date: Fri, 02 May 2014 15:15:07 -0700
> From: Ryan <rct at thompsonclan.org>
> To: shirley zhang <shirley0818 at gmail.com>
> Cc: "bioconductor at r-project.org" <bioconductor at r-project.org>
> Subject: Re: [BioC] how to calculate gene length to be used in rpkm()
> 	in	edgeR
>
> Hi Shirley,
>
> The appropriate gene length to use is whatever gene length was used to
> compute RPKM values for data set B. If you don't have that information,
> then I don't see how you can compute comparable RPKM values for your
> data.
>
> -Ryan
>
> On Fri May  2 15:01:32 2014, shirley zhang wrote:
>> Dear List,
>>
>> I've been used edgeR for differential expression analysis for data
>> generated from the same tissue, but different conditions.
>>
>> Now I have a RNAseq data A (n=20), and would like to compare them with
>> another RNAseq data B (n=1,000 across different tissues). Since data B is
>> normalized and batch-effect adjusted RPKM value, I need to generate RPKM
>> value for my own data A.
>>
>> I already had a count table, and would like to use rpkm() in edgeR, but
>> first I have to get a gene length vector. My question is how to count gene
>> length from an "Ensembl.gtf" file by taking into account the following:
>>
>> 1. Gene 1 is much longer than Gene 2 if including both exon and intron. But
>>      Gene 1 only has 3 exons, and Gene 2 has 10 exons --> for the
>> transcripts, Gene2>Gene1
>>
>> 2. For the same Gene, there are > 1 transcript isoforms.  In different
>> tissues, different transcript isoforms will be expressed.
>>
>> Many thanks,
>> Shirley

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}



More information about the Bioconductor mailing list