[BioC] easyRNASeq packages: overlapping exons warning unsolved (?) and use of paired end information

Tue Dec 3 14:01:53 CET 2013

Thanks Gordon!

---------------------------------------------------------------
Nicolas Delhomme

Nathaniel Street Lab
Department of Plant Physiology
Umeå Plant Science Center

Tel: +46 90 786 7989
Email: nicolas.delhomme at plantphys.umu.se
SLU - Umeå universitet
Umeå S-901 87 Sweden
---------------------------------------------------------------

On 02 Dec 2013, at 23:44, Gordon K Smyth <smyth at wehi.edu.au> wrote:

> I agree with Nicolas that counting reads is much more subtle than people
> think before they consider it in detail.  There is a detailed discussion
> of multiple counting issues and solutions in:
> 
>   http://www.ncbi.nlm.nih.gov/pubmed/24227677
> 
> The solutions are implemented in the RSubread package and support paired
> end reads and so on.
> 
> Best wishes
> Gordon
> 
>> Date: Mon, 2 Dec 2013 10:23:08 +0100
>> From: Nicolas Delhomme <nicolas.delhomme at umu.se>
>> To: Gabriele Zoppoli <zoppoli at gmail.com>
>> Cc: "bioconductor at r-project.org list" <bioconductor at r-project.org>
>> Subject: Re: [BioC] easyRNASeq packages: overlapping exons warning
>>      unsolved (?) and use of paired end information
>> 
>> Dear Gabriele,
>> 
>> Sorry I never came to see that post before. Not an excuse, but Bioc has such a huge traffic - I made a rule now to sort these emails.
>> 
>> On 06 Jun 2013, at 16:52, Gabriele Zoppoli <zoppoli at gmail.com> wrote:
>> 
>>> Dear Bioconductor mailing list,
>>> 
>>> I am trying to load and summarize ~ 60 paired-read BAM files from human
>>> cancer RNA-seq experiments using Illumina 2x50 protocol, for downstream use
>>> with edgeR and DESeq.
>>> 
>>> First question: I noticed several posts have been issued on the same topic,
>>> i.e. the way to solve the warning: "There are [any number] synthetic exons
>>> as determined from your annotation that overlap! This implies that some
>>> reads will be counted more than once! Is that really what you want?" when
>>> using easyRNASeq.
>>> 
>>> So far, I haven't seen any answer that doesn't pass through the use of
>>> GenomicRanges, or even any answer at all for some decently written posts.
>> 
>> Sadly that happens when some threads go off-list. I may miss times when this happen, as I receive numerous inquiries off-list. Many have to do with topics that have already been discussed on the list or are trivial so I do not consider it worth to bring these back to the list. Too much information kill the information.
>> 
>>> The easyRNASeq vignette is not entirely clear on that point. I'm therefore
>>> wondering whether anybody has come up with a solution and posted it in a
>>> plain and reproducible fashion.
>> 
>> Point taken, I?ve been discussing this a lot and I agree the vignette is the best place for that sort of information. However, the plain and reproducible is going to be difficult, because there are many different annotation formats and many different organisms specificity. It is very frequently necessary to make custom annotations because users have different questions or different usage. I?m finalising some edition to the vignettes with regards to this. Should be in version 1.8.4 by the end of this week. Note that there?s already some information on that in the vignette, in particular section 7.1.
>> 
>> To elaborate, the point of this warning is to make sure that you - the user - understand what you are doing. Most people think that counting reads is easy, but it?s all the contrary. And if you do not not understand how counting happens, you are bound to make mistakes that will eventually flaw your analysis. As there are many ways of counting and as users might have different opinions or different usage for read counting, there is no single optimal solution that can be implemented - things are standardising though and I?m working on having solution(s) for this in a next version of easyRNASeq. Anyway, at the time I designed the package, reporting a warning and letting the user deal with it as its convenience was the optimal thing to do. Hence, the warning is not for me to solve but for you to weight how much this might affect the analysis you are planning.
>> 
>> Now, if you look at the vignette in the version 1.8.3 of the package, you?ll see that section 7.1 describes how you can get read counts per ?gene? minimising cases where ?multiple counting? might occur; what I call generating ?synthetic transcripts?. Even when doing this, there will still be cases where it occurs, i.e. if genes on different strands are overlapping, which is what the warning message you are seeing above is telling you. To decide whether this is critical or not depends on your organisms and on the cases where these occur; e.g. in drosophila, poplar, these are often due to overlapping UTRs and I deal with these simply by shortening the UTRs to avoid multiple counting. Hence what you need to do after you perform the steps in section 7.1 is to check for ?synthetic transcripts? overlaps. Here?s how you can get the list of those:
>> 
>> obj <- as(syntheticGeneModel,?RangedData)
>> ovl <- findOverlaps(obj,ignoreSelf=TRUE,ignoreRedundant=TRUE)
>> 
>> and then investigate the ?ovl? HitLists object. I?m adding some details about this as well in the vignette; should be out by the end of the week. Moreover, I have added a flag in easyRNASeq that reports which annotation entry has an overlap, e.g. to flag the genes for downstream analyses. This information is only accessible with the outputFormat=?RNAseq? and the details will be in the vignette.
>> 
>> Finally, a note on easyRNASeq counting. EasyRNASeq assumes that you know what you are doing, i.e. at the moment it does not check if you are providing multiple mapping reads, chimeric reads (non properly paired reads), etc. You need to check what your aligner of choice reports, my settings drop multiple mapping and chimeric reads but I?ve investigated what consequences this has on my reference organism so I know which genes are affected by this. Actually, this behaviour of easyRNASeq will change in the next development version; checks will be performed on the data to assess strand specificity, presence of multiple mapping, chimeric reads, etc., which should make its use more intuitive and less error-prone.
>> 
>>> Second question: does anybody know whether the aforementioned easyRNASeq
>>> package makes use of the "properly paired" reads for summarization? I
>>> really couldn't find anything on that either, even after a month of
>>> googling around.
>> 
>> There?s no reason not to email me directly, so do not hesitate doing so whenever you have questions; please mind the changed email address. I can feel your frustration through this email and I understand it if you?ve been looking for answers for so long before emailing. Once again, I?m very sorry that I did not see that email earlier. I very welcome questions, comments, criticisms because eventually that feedback benefits me and others.
>> Anyway, the reason why there is no answer out there, is that I believe this is the first time that question is asked as you state it. The short answer is ?not at the moment?. The long answer is that easyRNASeq use a coverage approach - i.e. looking at the coverage of every bp and dividing that by the fractions of the fragments overlapping a given feature (e.g. an exon). I?m currently re-implementing the whole package so as to better support pairing, strandedness, parallel processing, etc. Again to re-iterate the comments above, at the moment to use easyRNASeq in a sensible way, knowing the content of your BAM file - i.e. which reads are reported by your aligner - is essential for accurate counting. This should be significantly simplified, once I?m done with the easyRNASeq re-implementation.
>> 
>> Once again, sorry for not seeing that email before.
>> 
>> Hope it helps anyway,
>> 
>> Cheers,
>> 
>> Nico
>> 
>> 
>>> 
>>> That's what I've done so far:
>>> 
>>> countTable <- easyRNASeq(filesDirectory=getwd(),
>>>                        organism="Hsapiens",
>>>                        annotationMethod="rda",
>>>                        annotationFile="gAnnot.rda",
>>>                        gapped=TRUE, count="genes",
>>>                        summarization="geneModels",
>>>                        filesDirectory=getwd(),
>>>                        filenames=BAM_files,
>>>            outputFormat="RNAseq",
>>>                nBcores=4)
>>> Checking arguments...
>>> Fetching annotations...
>>> Computing gene models...
>>> Summarizing counts...
>>> Processing sample1.bam
>>> Updating the read length information.
>>> The alignments are gapped.
>>> Minimum length of 1 bp.
>>> Maximum length of 51 bp.
>>> [...]
>>> Preparing output
>>> 
>>> Warning messages:
>>> [...]
>>> 2: In easyRNASeq(filesDirectory = getwd(), organism = "Hsapiens",
>>> annotationMethod = "rda",  :
>>> There are 16816 synthetic exons as determined from your annotation that
>>> overlap! This implies that some reads will be counted more than once! Is
>>> that really what you want?
>>> [...]
>>> 
>>> ##rda file is derived from a previous iteration of the same command using
>>> annotationMethod="biomaRt" and then doing
>>> 
>>> gAnnot <- genomicAnnotation(count.genes)
>>> gAnnot <- gAnnot[space(gAnnot) %in%
>>> paste("chr",c(1:22,"X","Y","M"),sep=""),]
>>> save(gAnnot,file="gAnnot.rda")
>>> 
>>> as suggested by Nicholas Delhomme
>>> 
>>> Thanks in advance!
>>> 
>>> sessionInfo()
>>> R version 3.0.1 (2013-05-16)
>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>> 
>>> locale:
>>> [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>> [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>> [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>>> [7] LC_PAPER=C                 LC_NAME=C
>>> [9] LC_ADDRESS=C               LC_TELEPHONE=C
>>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>> 
>>> attached base packages:
>>> [1] parallel  stats     graphics  grDevices utils     datasets  methods
>>> [8] base
>>> 
>>> other attached packages:
>>> [1] easyRNASeq_1.6.0       ShortRead_1.18.0       latticeExtra_0.6-24
>>> [4] RColorBrewer_1.0-5     Rsamtools_1.12.3       DESeq_1.12.0
>>> [7] lattice_0.20-15        locfit_1.5-9.1         BSgenome_1.28.0
>>> [10] GenomicRanges_1.12.4   Biostrings_2.28.0      IRanges_1.18.1
>>> [13] edgeR_3.2.3            limma_3.16.5           biomaRt_2.16.0
>>> [16] Biobase_2.20.0         genomeIntervals_1.16.0 BiocGenerics_0.6.0
>>> [19] intervals_0.14.0
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] annotate_1.38.0      AnnotationDbi_1.22.6 bitops_1.0-5
>>> [4] DBI_0.2-7            genefilter_1.42.0    geneplotter_1.38.0
>>> [7] grid_3.0.1           hwriter_1.3          RCurl_1.95-4.1
>>> [10] RSQLite_0.11.4       splines_3.0.1        stats4_3.0.1
>>> [13] survival_2.37-4      XML_3.96-1.1         xtable_1.7-1
>>> [16] zlibbioc_1.6.0
>>> 
>>> 
>>> 
>>> 
>>> --
>>> Gabriele Zoppoli, MD
>>> Ph.D., Clinical and Experimental Oncology and Hematology
>>> Visiting Researcher, BCTRL J.C. Heuson, Institut J. Bordet, Bruxelles BE
>>> Internal Medicine Resident, DiMI, IRCCS AOU San Martino IST, Genova, IT
>>> Former Guest Researcher, LMP, CCR, NCI, NIH, Bethesda MD
>>> 
>>> 
>>> Tel: +39 010 353 7968
>>> Mobile 1: +32 478 337 942
>>> Mobile 2: +39 349 617 0129
>>> Email:           gabriele.zoppoli at unige.it
>>> Alt. Email:     zoppoli at gmail.com
>>> Alt. Email 2:  gzoppoli at libero.it
>>> Alt. Email 3:  gabriele.zoppoli at bordet.be
>>> ----------------------------------------------------------
> 
> ______________________________________________________________________
> The information in this email is confidential and intend...{{dropped:4}}
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor