[BioC] package to count read prior to DESeq?

Fri Feb 3 22:12:00 CET 2012

Hi Abhi,

> Hi Nico
> 
> Thanks for a quick and detailed reply. I am a tiny bit confused, may
> be my question was not clear. Here is how I understand it. Please
> correct me if I got it wrong.
> 
> What I would like to do is study differential gene expression at both
> gene and transcript level.  As an input what I have is transcriptome
> model in a gtf file and mapped bam files(unique hits only: I have
> removed reads mapping to multiple positions in the genome). The gtf
> file lists all the possible transcripts/exons for a gene.

Sorry, I misunderstood what you meant by your cufflink gtf file. Indeed, you should be able to use it with easyRNASeq.

> 
> What I need to get from this file two types of counts.
>        1)  counts of reads that fall in each gene bin (strand specific)
>        2)  count of reads that fall in each transcript (again strand sp)
> 

I haven't had my hands on strand specific data so far, so at yet it is not supported but I'd be glad to add that into easyRNASeq (was still on my TODO list, don't get enough time... why are the days so short ;-)). Could you provide me with a small excerpt of your data, i.e. a region where you have reads on both strands coming from different transcripts? I could certainly simulate those, but real data is always better.

> What I am not sure is the counting method of easyRNASeq. Given an
> annotation of gene with 2 transcripts (A, B) how does it do the
> counting. If it is counting at gene level then at some point in the
> calculation it should be merging the exons from transcripts A and B to
> gene level.  And if it is counting at transcript level then we should
> get two counts one for each transcript. I understand in the latter
> case there will be double counting due to overlapping exons between
> the transcripts.

You're correct. You can precise for which features you want to count using the "count" argument. It takes one of four possible values: features, exons, transcripts, genes.

features: Generic features can be defined, i.e. simple interval on the genome, which for example are useful to quantify reads present on enhancers (eRNA).
exons: counts are summarized per exon
transcripts: counts are summarized per transcript. This indeed would results in double counting exons common to different transcripts
genes: in that case, you need to precise an additional argument: summarization. It takes one of two values: geneModels and bestExons. The later one is a relic when we were looking at the coverage bias observed across transcript. The former one calculates synthetic exons, i.e. it separates the different exons of a gene so that they do not contain any overlap.

> 
> It will help a lot if you could clear this.
> 

As a consequence I would not use the transcripts summarization (as warned by the package and mentioned in the vignette). In your case, you could get the geneModels from easyRNASeq using your gtf and then use that information to deconvolute the effect of every transcript in a gene, using the synthetic exons. Again this would be far from trivial. 
I think the best approach to your problem is the DEXSeq package that look at the differential expression of exons. Using it you could identify which exons are differentially expressed between conditions and trace it back to the causative set of transcripts. To do so, you need to first obtain the synthetic exons from your gtf file. You can either 1) use easyRNASeq to first get the geneModels and then rerun easyRNASeq to get the exon expression, or 2) you directly transform your gtf file into a GRangesList with one GRanges per gene. Then per GRanges, you use the disjoin function to get the synthetic ranges. This object you can provide as annotation to easyRNASeq to get the exon counts. Finally once you've got your exon count, you apply DEXSeq.

> PS: I also did try to use htseq-count for doing the same but the
> counts I got dont quite match the avg number of reads we expect to see
> / gene or transcript. I guess I should start another thread on this
> with Simon. May be I am doing something wrong.

Well, Simon might chime in there for more details, but from what he told me htseq-count does the same as the easyRNASeq geneModels; it creates synthetic exons and summarize the reads per synthetic exons.

Using DEXSeq is probably how I would do in your situation (I'm adding to my TODO list the possibility to create a DEXSeq object from easyRNASeq). The sketch I've described above is far from complete, but if you find it makes sense, I'd be glad to help you with the bolts and nuts :-)

Cheers,

Nico

> 
> Thanks!
> -Abhi
> 
> 
> 
> 
> On Fri, Feb 3, 2012 at 11:48 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>> Dear Abhi,
>> 
>> Yes, we can certainly tweak it to do so, but I'm not sure how much sense that would make.
>> 
>> First, by doing so would in end effect result in counting read several times; i.e. reads involved in the same exons of different transcripts will be accounted for several time, so you would bias your data.
>> 
>> Second, easyRNASeq is meant to start with an unmodified bam file, i.e. after any quality pre-processing has been done but before any kind of normalization/summarization. What easyRNASeq does is to compute the number of reads per feature (exon, transcript, gene,...) and normalize these results (or not) into RPKM or by using the DESeq or edgeR packages. Both these packages require "raw" unmodified data.
>> 
>> Transforming transcript "counts" (or FPKM) into genes count is definitely not a straightforward process. You would need to be able to deconvolute every transcript effect, e.g. for a gene with two transcript A and B, you would need to establish the proportion of the different effect from the part unique to transcript A (the part not overlapping transcript B), that of transcript B and their common part. Knowing that Illumina has a significant (and somewhat hard to predict) read coverage bias along any transcript (luckily that bias is reproducible...), it makes it IMO hardly possible to achieve.
>> 
>> If you are interested in gene counts, I would suggest to use easyRNASeq on its own, starting from your unmodified bam files. Then depending on your experiment, e.g. if you're doing some kind of differential expression analysis, apply DESeq, baySeq or edgeR.
>> 
>> I hope this helps,
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On 3 Feb 2012, at 19:51, Abhishek Pratap wrote:
>> 
>>> Hi Nico
>>> 
>>> With respect to easyRNAeq package I am wondering if I supply it a
>>> transcript level gtf file can it produce gene level counts. Basically
>>> the file is the output from cufflinks.
>>> 
>>> the last(attributes) column in the file does have the gene_id attribute.
>>> 
>>> 
>>> Best,
>>> -Abhi
>>> 
>>> 
>>> 
>>> On Fri, Feb 3, 2012 at 8:06 AM, Nicolas Delhomme <delhomme at embl.de> wrote:
>>>> Hi Nat,
>>>> 
>>>> That's what the easyRNASeq (Bioc 2.10) pacakge exactly is for. Let me know if you need any help.
>>>> 
>>>> Cheers,
>>>> 
>>>> Nico
>>>> 
>>>> ---------------------------------------------------------------
>>>> Nicolas Delhomme
>>>> 
>>>> Genome Biology Computational Support
>>>> 
>>>> European Molecular Biology Laboratory
>>>> 
>>>> Tel: +49 6221 387 8310
>>>> Email: nicolas.delhomme at embl.de
>>>> Meyerhofstrasse 1 - Postfach 10.2209
>>>> 69102 Heidelberg, Germany
>>>> ---------------------------------------------------------------
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On 3 Feb 2012, at 16:39, nathalie wrote:
>>>> 
>>>>> Hi,
>>>>> Is there a package to create a count table from bam files, which gives an output compatible with DESeq analysis?
>>>>> thanks
>>>>> Nat
>>>>> 
>>>>> 
>>>>> --
>>>>> The Wellcome Trust Sanger Institute is operated by Genome Research Limited, a charity registered in England with number 1021457 and a company registered in England with number 2742969, whose registered office is 215 Euston Road, London, NW1 2BE.
>>>>> 
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>> 
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>>