[BioC] reads of different lengths (easyRNASeq)

Nicolas Delhomme delhomme at embl.de
Wed Jul 4 18:15:41 CEST 2012


Dear Mark,

I'm cc'ing the Bioconductor mailing list in case it helps others.

I've implemented the support for variable read length (version 1.3.7 of easyRNASeq, available in SVN now and in a couple of days from Bioc). Thanks for your reproducible use case. Below is the code I ran:

library(easyRNASeq)

counts <- easyRNASeq(organism="Dmelanogaster",
                     annotationMethod="gtf",
                     annotationFile="Drosophila_melanogaster.BDGP5.67.gtf",
                     gapped=TRUE, count="genes",
                     summarization="geneModels",
                     pattern=".subread_s.bam$", 
                     filesDirectory=".")

It took 4 minutes on my machine. Having variable read length does affect the performance but I'm working on that.

In the command line, note that I removed the argument 'format' as bam is now the default. Since you're using 'bam', the arguments chr.sizes is not necessary either as the chromosome lengths are extracted from the BAM file. Using the chr.map is not necessary for Dmelanogaster as easyRNASeq should correctly convert the chromosome names. To list which organism easyRNASeq support that conversion for, I have added a method: knownOrganisms that lists them. If your organism is not part of that list but the chromosome/scaffold names are identical between the BAM file and the annotation file, it will work smoothly too.

Thanks again,

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On Jun 13, 2012, at 2:21 PM, Mark Robinson wrote:

> Dear Nico,
> 
> I get this error (see below for full details):
> 
> Error in fetchCoverage(obj, format = format, filename = file, filter = filter,  : 
>  The file ./SRR031714.subread_s.bam contains reads of different sizes: 74, 70 . We cannot deal with such data at the moment. Please contact the authors to add this functionality.
> 
> With read trimming and so on, won't this be a common occurrence?  Your error message says to ask you to add this functionality, so I ask.
> 
> Best,
> Mark
> 
> 
>> # setup
>> gtf <- "Drosophila_melanogaster.BDGP5.67.gtf"
>> fn <- dir(,".subread_s.bam$") # file names
>> 
>> nm <- names( seqlengths(Dmelanogaster) )
>> chr.map <- data.frame(from=nm,to=gsub("chr","",nm))
>> #chr.map <- data.frame(from=nm,to=gsub("chr","",nm))
>> 
>> counts <- easyRNASeq(organism="Dmelanogaster", chr.sizes=as.list(seqlengths(Dmelanogaster)),
> +                      readLength=74L, annotationMethod="gtf", annotationFile=gtf, format="bam", gapped=TRUE,
> +                      count="exons", filenames=fn, filesDirectory=".", chr.map=chr.map)
> Checking arguments... 
> Fetching annotations... 
> Read 308180 records
> Summarizing counts... 
> Processing SRR031708.subread_s.bam 
> Processing SRR031709.subread_s.bam 
> Processing SRR031710.subread_s.bam 
> Processing SRR031711.subread_s.bam 
> Processing SRR031712.subread_s.bam 
> Processing SRR031713.subread_s.bam 
> Processing SRR031714.subread_s.bam 
> Error in fetchCoverage(obj, format = format, filename = file, filter = filter,  : 
>  The file ./SRR031714.subread_s.bam contains reads of different sizes: 74, 70 . We cannot deal with such data at the moment. Please contact the authors to add this functionality.
> In addition: There were 23 warnings (use warnings() to see them)
>> sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
> [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
> [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
> [7] LC_PAPER=C                 LC_NAME=C                 
> [9] LC_ADDRESS=C               LC_TELEPHONE=C            
> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods  
> [8] base     
> 
> other attached packages:
> [1] GenomicFeatures_1.8.1                 
> [2] AnnotationDbi_1.18.1                  
> [3] BSgenome.Dmelanogaster.UCSC.dm3_1.3.17
> [4] easyRNASeq_1.2.3                      
> [5] ShortRead_1.14.4                      
> [6] latticeExtra_0.6-19                   
> [7] RColorBrewer_1.0-5                    
> [8] lattice_0.20-6                        
> [9] Rsamtools_1.8.5                       
> [10] DESeq_1.8.3                           
> [11] locfit_1.5-8                          
> [12] BSgenome_1.24.0                       
> [13] GenomicRanges_1.8.6                   
> [14] Biostrings_2.24.1                     
> [15] IRanges_1.14.3                        
> [16] edgeR_2.6.7                           
> [17] limma_3.12.1                          
> [18] biomaRt_2.12.0                        
> [19] Biobase_2.16.0                        
> [20] genomeIntervals_1.12.0                
> [21] BiocGenerics_0.2.0                    
> [22] intervals_0.13.3                      
> 
> loaded via a namespace (and not attached):
> [1] annotate_1.34.0    bitops_1.0-4.1     DBI_0.2-5          genefilter_1.38.0 
> [5] geneplotter_1.34.0 grid_2.15.0        hwriter_1.3        RCurl_1.91-1      
> [9] RSQLite_0.11.1     rtracklayer_1.16.1 splines_2.15.0     stats4_2.15.0     
> [13] survival_2.36-14   tools_2.15.0       XML_3.9-4          xtable_1.7-0      
> [17] zlibbioc_1.2.0   
> 
> 
> 
> 
> ----------
> Prof. Dr. Mark Robinson
> Bioinformatics
> Institute of Molecular Life Sciences
> University of Zurich
> Winterthurerstrasse 190
> 8057 Zurich
> Switzerland
> 
> v: +41 44 635 4848
> f: +41 44 635 6898
> e: mark.robinson at imls.uzh.ch
> o: Y11-J-16
> w: http://tiny.cc/mrobin
> 



More information about the Bioconductor mailing list