[BioC] get synthetic exon dataset with easyRNASeq

Fri Nov 9 10:24:07 CET 2012

Hi Meritxell,

Sorry, I forgot to answer that email. I've put the Bioc mailing list in Cc as it might be of interest to others. 

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------

On Oct 16, 2012, at 5:01 PM, Meritxell Oliva wrote:

> Hi Nicolas,
> 
> How does easyRNASeq handles the multiple mapping reads in the *.bam files? I observe that at least a trimming treatment is applied to  mapped reads  when needed (look at log below)

easyRNASeq does not handle multiple mapping reads per se. If you provide a file that contains multiple mapping, it will just extract the read alignment information, so de-facto taking advantage of multiple mapping. It's something I'm working on as well at the moment (i.e. using such tools to get more power out of my analyses), so take the above sentence with a grain of salt as the way multiple alignments are reported certainly depends on the aligner being used; so such information might be overseen for some tools at the moment. If you tell me which tool you are using and how multiple mapping reads are specified in the bam output I can tell you how easyRNASeq is going to behave. 

There's no trimming occurring in easyRNASeq, that message is indeed misleading. It simply means that your bam file contains reads of variable size, either because they have been trimmed prior to the alignment or the aligner software used returned gapped alignments, i.e. reads were split in 2 (or more) fragments.

> 
> > CEU.RNASeq.geneModel <- easyRNASeq(outputFormat="RNAseq",organism="HSapiens",count="genes",summarization="geneModels",pattern=".*\\.bam$",filesDirectory="../Montgomery2010/bam_hg19/",annotationFile="Homo_sapiens.GRCh37.68.gtf",annotationMethod="gtf");
> Checking arguments...
> Fetching annotations...
> Read 2156097 records
> Computing gene models...
> Summarizing counts...
> Processing E-MTAB-197.BAM.ERR009096.bam
> Updating the read length information.
> The reads have been trimmed.
> Minimum length of 36 bp.
> Maximum length of 38 bp.
> 
> so, I wonder which are the additional filters + treatments that are applied. 

There's no filter/treatment.

>  
> I do have some additional questions: what is the best way to assign reads to transcripts uniquely?

That's a good question. It's complex because of the presence of isoforms (without talking about paralogs / gene family). Using multiple mapping read is a start. Then if you are interested in differential isoform expression, I would rather get a set of gene models (disjoint to get unique and shared exonic regions as discussed below) and run something like DEXSeq to get differential exon usage. Then based on this information, go backwards and try to identify which isoform(s) that particular exon is a member of. In my opinion trying to summarize per isoforms is a much more complex approach and the likelyhood to get false positive much higher. But again that's just my personal opinion and a description of the approach I would use (provided I've the correct kind and amount of data).

> I've already assigned reads to genes uniquely [OBJECT.GENE=easyRNASeq(counts=gene,summarization=geneModels,etc...] Is it possible to use the RNASeq object containing the synthetic exons obtained by geneModel(OBJECT.GENE) for that purpose?

Yes, indeed.

> If so, which are the functions to be run?

geneModel() to get the geneModel out of the RNAseq object. Then using GenomicRanges/IRanges function to modify that object. I've been planning for a while now to add this as a use case in the vignette, but hardly find the time to do so.

> I've had a look at the manual:
> 
> create transcript specific synthetic exons, i.e. features that would be unique to a transcript. To offer this possibility, transcripts count can be summarized from "features", in which case the annotation object need to have both the "feature" and "transcript" fields defined.
> 
> , but it's not clear in my mind how to do it. Could you provide a further explanation?

As said, I've been wanting to implement a use case for that. If you're willing to share with me your RNAseq object, we could do that together. Just let me know.

> 
> Last but not least: the syntetic exon dataset seems not to be exactly what I am looking for: reads are assigned uniquely to exonic regions, which are created by merging overlapping exons. I think it would be interesting to disjoin the exons and have the reads assigned to unique exonic regions and shared exonic regions.

That's what happen, overlapping exons are disjointed into non-overlapping and overlapping synthetic exons. I'm currently working on that with the Bioconductor developer to come up with a single solution, based on the GenomicRanges summarizeOverlap function.

>  Is there a way/mode i to achieve that using asn easyRNASeq implemented model? I think it would be equivalent to what u achieve by using disjoin() function provided by IRange.

Yes, it is. And that's what we are trying to come to, a unique set of function for this.

Hope this helps shed some light on what's happening in easyRNASeq.

Cheers,

Nico 

> 
> Thanks
> 
> 
> On Oct 12, 2012, at 9:36 AM, Nicolas Delhomme wrote:
> 
>> Hi Meritxell,
>> 
>> I've Cced the Bioc Mailing list in case this is of interest to others.
>> 
>> On Oct 11, 2012, at 6:15 PM, Meritxell Oliva wrote:
>> 
>>> Hi Nicolas,
>>> 
>>> I am an easyRNASeq "newbie" user.
>>> 
>>> First of all, congratulations for the development of the pipeline: so far it's one of the best R libraries I have found to deal with RNASeq data, as it tries to tackle problematic issues such as unique read-exon count assignment and also wraps the normalization packages (DESeq, edgeR), so you get all you need in one go. Thanks!
>>> 
>> 
>> Thanks, that's nice to hear. Let me know as well whenever your encounter problems of think of new features!
>> 
>>> I do have a question: I would like to create a non-redundant, synthetic exon dataset, using the Ensembl68 gene models. From what I understand from the manual, when using easyRNASeq() if you summarize your counts by counts=gene,summarization=geneModels, this  synthetic exon dataset is generated in order to create unique read-exon correspondances. This is what I do, and I store the object as RNASeq object, to preserve the genomic annotation. However, the annotation that I get if I apply the function genomicAnnotation() to this object, is the original one from Ensembl, with redundant exons shared between transcripts. I would like to get the synthetic exon dataset, to select unique coding regions for each gene transcript. 
>>> 
>>> How can I get this dataset? My ultimate goal is to perform gene expression differential analysis at gene, transcript and exon level. First one is solved, and I want to find the best way to do perform the latter ones.
>>> 
>> 
>> At the moment it's still a dual step process, but I plan on making that easier. You first need to run easyRNASeq(counts=gene,summarization=geneModels,etc...) and asking to get an "RNAseq" outputFormat: rnaSeq <- easyRNASeq(counts=gene,summarization=geneModels,outputFormat="RNAseq",etc...). This will give you an object of the class RNAseq that contains the geneModel annotation accessible through geneModel(rnaSeq). That's a RangedData object containing the synthetic exon, although it is still redundant for genes located on opposite strands. So if you're not using stranded RNAseq data, you need to do some more filtering.
>> 
>>> Can you help me?
>>> 
>> 
>> Hope this did, let me know if not,
>> 
>> Nico
>> 
>>> Thanks
>>> 
>>> 
>>> Meritxell Oliva
>>> PhD student 
>>> IBB (Biotechnology and Biomedicine Institute)
>>> Comparative and Functional Genomics group
>>> Campus Universitari - 08193 Bellaterra Cerdanyola del Vallès - Barcelona
>>> 
>>> 
>>> 
>>> 
>> 
> 
> Meritxell Oliva
> PhD student 
> IBB (Biotechnology and Biomedicine Institute)
> Comparative and Functional Genomics group
> Campus Universitari - 08193 Bellaterra Cerdanyola del Vallès - Barcelona
> 
> 
> 
>