[BioC] easyRNASeq - error reading single chromosome bam

Nicolas Delhomme delhomme at embl.de
Tue Feb 7 15:46:13 CET 2012


Hi Francesco,

The problem is that the name of the chromosome in the chr.size and annotation objects are not the same as in your bam file . Some warnings must say so, look in the 50+ you got by typing warnings().

Why is it important that the names are identical? 

1) between the annotation and the bam file: when assigning the reads to the gene, is it important that they both share the same reference, i.e. that they can be located on the same chromosomes. 

2) between the annotation, bam file and the chr.size: imagine the case where you have genes present in your annotation that are located at the end of the chromosomes and for which you get no reads. Without defining the chr.size, the coverage vector generated using the reads would stop before the end of the chromosomes. Trying to access the coverage information for these genes would then simply fail.

In your case, chromosomes are called GL000225.1 or 22 in your bam files whereas they are probably called chr1, chr2, etc... in your annotation and chr.sizes objects. In such case, I'm trying to guess the right names and emit warnings.  Can you please post the warnings so that I double-check this?

Finally, the fitType="local" arguments is one for DESeq, so you can drop it as well here, not that it would have any effect when not using DESeq normalization.

Btw, I've adapted easyRNASeq to the new DESeq version which object structure changed in the current development release,but it seems you've got the latest release :-)

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
---------------------------------------------------------------





On 7 Feb 2012, at 15:07, Lescai, Francesco wrote:

> Hi Nico,
> thanks for your quick answer.
> Here it is the information you ask, it might be quite lengthy.
> Unfortunately I get the same error in generating the RNAseq object as well.
> 
> > countDataSet <- easyRNASeq(filesDirectory=getwd(),
> +                            organism="HSapiens",
> +                            chr.sizes=as.list(seqlengths(Hsapiens)),
> +                            readLength=100L,
> +                            annotationMethod="biomaRt",
> +                            format="bam",
> +                            count="exons",
> +                            filenames=bamfiles,
> +                            normalize=FALSE,
> +                            outputFormat="RNAseq",
> +                           fitType="local"
> +                           )
> Checking arguments... 
> Fetching annotations... 
> Summarizing counts... 
> Processing SRR349689_1.fastq.novo.new.chr1.bam 
> Error in FUN(c("GL000229.1", "GL000200.1", "GL000228.1", "GL000241.1",  : 
>   (list) object cannot be coerced to type 'integer'
> In addition: There were 50 or more warnings (use warnings() to see the first 50)
> 
> I also tried using a local annotation file, generated this way
> 
> hg19.tx <- makeTranscriptDbFromUCSC(
>  genome="hg19",
>  tablename="refGene")
> exon.range<-exons(hg19.tx)
> gAnnot <- RangedData(exon.range)
> colnames(gAnnot)[2]<-"exon"
> save(gAnnot,file="gAnnot.rda")
> 
> but then the error is the same.
> 
> > countDataSet <- easyRNASeq(filesDirectory=getwd(),
> +                            organism="HSapiens",
> +                            chr.sizes=as.list(seqlengths(Hsapiens)),
> +                            readLength=100L,
> +                            annotationMethod="rda",
> +                            annotationFile="gAnnot.rda",
> +                            format="bam",
> +                            count="exons",
> +                            filenames=bamfiles,
> +                            normalize=TRUE,
> +                            outputFormat="DESeq",
> +                            conditions=conditions,
> +                           fitType="local"
> +                           )
> Checking arguments... 
> Fetching annotations... 
> Summarizing counts... 
> Processing SRR349689_1.fastq.novo.new.chr1.bam 
> Error in FUN(c("GL000229.1", "GL000200.1", "GL000228.1", "GL000241.1",  : 
>   (list) object cannot be coerced to type 'integer'
> 
> 
> The .bam file has been generated following Novocraft instructions, from a combined reference of the genome, the transcriptome and the junctions and then recomposed with USeq.
> (http://www.novocraft.com/wiki/tiki-index.php?page=RNASeq+analysis%3A+mRNA+and+the+Spliceosome&structure=Novocraft+Technologies&page_ref_id=35)
> 
> is there any other information/file that could help in tracking the issue?
> 
> thanks,
> Francesco
> 
> 
> 
> ---bam header---
> samtools view -H SRR349689_1.fastq.novo.new.chr1.bam
> @HD VN:1.0
> SO:unsorted
> @PG ID:SamTranscriptomeParser
> CL: args -f SRR349689_1.fastq.novo.sam -u -a 100000 -n 100000 -s SRR349689_1.fastq.novo_converted.sam
> @RG ID:unknownReadGroup
> SM:unknownSample
> @SQ SN:GL000229.1
> AS:trascriptome.nix  LN:119339
> @SQ SN:GL000200.1
> AS:trascriptome.nix  LN:277914
> @SQ SN:GL000228.1
> AS:trascriptome.nix  LN:228349
> @SQ SN:GL000241.1
> AS:trascriptome.nix  LN:136814
> @SQ SN:GL000219.1
> AS:trascriptome.nix  LN:220315
> @SQ SN:GL000191.1
> AS:trascriptome.nix  LN:202431
> @SQ SN:GL000242.1
> AS:trascriptome.nix  LN:142826
> @SQ SN:GL000243.1
> AS:trascriptome.nix  LN:106738
> @SQ SN:22
> AS:trascriptome.nix  LN:51342124
> @SQ SN:GL000245.1
> AS:trascriptome.nix  LN:132396
> @SQ SN:MT
> AS:trascriptome.nix  LN:116551
> @SQ SN:3
> AS:trascriptome.nix  LN:198058734
> @SQ SN:GL000217.1
> AS:trascriptome.nix  LN:224716
> @SQ SN:2
> AS:trascriptome.nix  LN:243284564
> @SQ SN:1
> AS:trascriptome.nix  LN:249336112
> @SQ SN:7
> AS:trascriptome.nix  LN:159196479
> @SQ SN:6
> AS:trascriptome.nix  LN:171154887
> @SQ SN:GL000215.1
> AS:trascriptome.nix  LN:232888
> @SQ SN:5
> AS:trascriptome.nix  LN:181002586
> @SQ SN:4
> AS:trascriptome.nix  LN:191137854
> @SQ SN:9
> AS:trascriptome.nix  LN:141252793
> @SQ SN:GL000244.1
> AS:trascriptome.nix  LN:129122
> @SQ SN:GL000216.1
> AS:trascriptome.nix  LN:230528
> @SQ SN:8
> AS:trascriptome.nix  LN:146387700
> @SQ SN:GL000218.1
> AS:trascriptome.nix  LN:197377
> @SQ SN:GL000248.1
> AS:trascriptome.nix  LN:124693
> @SQ SN:GL000224.1
> AS:trascriptome.nix  LN:226253
> @SQ SN:GL000210.1
> AS:trascriptome.nix  LN:123989
> @SQ SN:19
> AS:trascriptome.nix  LN:59216288
> @SQ SN:17
> AS:trascriptome.nix  LN:81293300
> @SQ SN:18
> AS:trascriptome.nix  LN:78015394
> @SQ SN:GL000194.1
> AS:trascriptome.nix  LN:209503
> @SQ SN:15
> AS:trascriptome.nix  LN:102616513
> @SQ SN:16
> AS:trascriptome.nix  LN:90388655
> @SQ SN:13
> AS:trascriptome.nix  LN:115207096
> @SQ SN:14
> AS:trascriptome.nix  LN:107386086
> @SQ SN:GL000195.1
> AS:trascriptome.nix  LN:278899
> @SQ SN:11
> AS:trascriptome.nix  LN:134990552
> @SQ SN:12
> AS:trascriptome.nix  LN:133927113
> @SQ SN:GL000225.1
> AS:trascriptome.nix  LN:273676
> @SQ SN:21
> AS:trascriptome.nix  LN:48217233
> @SQ SN:20
> AS:trascriptome.nix  LN:63026254
> @SQ SN:GL000193.1
> AS:trascriptome.nix  LN:214851
> @SQ SN:GL000246.1
> AS:trascriptome.nix  LN:126826
> @SQ SN:GL000237.1
> AS:trascriptome.nix  LN:117774
> @SQ SN:GL000204.1
> AS:trascriptome.nix  LN:180839
> @SQ SN:Y
> AS:trascriptome.nix  LN:59133001
> @SQ SN:GL000247.1
> AS:trascriptome.nix  LN:135428
> @SQ SN:X
> AS:trascriptome.nix  LN:155357811
> @SQ SN:GL000205.1
> AS:trascriptome.nix  LN:219192
> @SQ SN:GL000192.1
> AS:trascriptome.nix  LN:644506
> @SQ SN:GL000227.1
> AS:trascriptome.nix  LN:203665
> @SQ SN:GL000197.1
> AS:trascriptome.nix  LN:132711
> @SQ SN:GL000236.1
> AS:trascriptome.nix  LN:126779
> @SQ SN:GL000211.1
> AS:trascriptome.nix  LN:262752
> @SQ SN:GL000240.1
> AS:trascriptome.nix  LN:139915
> @SQ SN:GL000239.1
> AS:trascriptome.nix  LN:123498
> @SQ SN:GL000232.1
> AS:trascriptome.nix  LN:102363
> @SQ SN:GL000212.1
> AS:trascriptome.nix  LN:282254
> @SQ SN:GL000238.1
> AS:trascriptome.nix  LN:132164
> @SQ SN:GL000233.1
> AS:trascriptome.nix  LN:108194
> @SQ SN:GL000249.1
> AS:trascriptome.nix  LN:137299
> @SQ SN:GL000223.1
> AS:trascriptome.nix  LN:280265
> @SQ SN:GL000199.1
> AS:trascriptome.nix  LN:241559
> @SQ SN:10
> AS:trascriptome.nix  LN:135619864
> @SQ SN:GL000209.1
> AS:trascriptome.nix  LN:167344
> @SQ SN:GL000196.1
> AS:trascriptome.nix  LN:132047
> @SQ SN:GL000202.1
> AS:trascriptome.nix  LN:45488
> @SQ SN:GL000220.1
> AS:trascriptome.nix  LN:261756
> @SQ SN:GL000214.1
> AS:trascriptome.nix  LN:224473
> @SQ SN:GL000198.1
> AS:trascriptome.nix  LN:91607
> @SQ SN:GL000234.1
> AS:trascriptome.nix  LN:32013
> @SQ SN:GL000213.1
> AS:trascriptome.nix  LN:237085
> @SQ SN:GL000221.1
> AS:trascriptome.nix  LN:182753
> @SQ SN:GL000222.1
> AS:trascriptome.nix  LN:264765
> @SQ SN:GL000206.1
> AS:trascriptome.nix  LN:138000
> 
> 
> ----------lengths-----------------
> $chr1
> [1] 249250621
> 
> $chr2
> [1] 243199373
> 
> $chr3
> [1] 198022430
> 
> $chr4
> [1] 191154276
> 
> $chr5
> [1] 180915260
> 
> $chr6
> [1] 171115067
> 
> $chr7
> [1] 159138663
> 
> $chr8
> [1] 146364022
> 
> $chr9
> [1] 141213431
> 
> $chr10
> [1] 135534747
> 
> $chr11
> [1] 135006516
> 
> $chr12
> [1] 133851895
> 
> $chr13
> [1] 115169878
> 
> $chr14
> [1] 107349540
> 
> $chr15
> [1] 102531392
> 
> $chr16
> [1] 90354753
> 
> $chr17
> [1] 81195210
> 
> $chr18
> [1] 78077248
> 
> $chr19
> [1] 59128983
> 
> $chr20
> [1] 63025520
> 
> $chr21
> [1] 48129895
> 
> $chr22
> [1] 51304566
> 
> $chrX
> [1] 155270560
> 
> $chrY
> [1] 59373566
> 
> $chrM
> [1] 16571
> 
> $chr1_gl000191_random
> [1] 106433
> 
> $chr1_gl000192_random
> [1] 547496
> 
> $chr4_ctg9_hap1
> [1] 590426
> 
> $chr4_gl000193_random
> [1] 189789
> 
> $chr4_gl000194_random
> [1] 191469
> 
> $chr6_apd_hap1
> [1] 4622290
> 
> $chr6_cox_hap2
> [1] 4795371
> 
> $chr6_dbb_hap3
> [1] 4610396
> 
> $chr6_mann_hap4
> [1] 4683263
> 
> $chr6_mcf_hap5
> [1] 4833398
> 
> $chr6_qbl_hap6
> [1] 4611984
> 
> $chr6_ssto_hap7
> [1] 4928567
> 
> $chr7_gl000195_random
> [1] 182896
> 
> $chr8_gl000196_random
> [1] 38914
> 
> $chr8_gl000197_random
> [1] 37175
> 
> $chr9_gl000198_random
> [1] 90085
> 
> $chr9_gl000199_random
> [1] 169874
> 
> $chr9_gl000200_random
> [1] 187035
> 
> $chr9_gl000201_random
> [1] 36148
> 
> $chr11_gl000202_random
> [1] 40103
> 
> $chr17_ctg5_hap1
> [1] 1680828
> 
> $chr17_gl000203_random
> [1] 37498
> 
> $chr17_gl000204_random
> [1] 81310
> 
> $chr17_gl000205_random
> [1] 174588
> 
> $chr17_gl000206_random
> [1] 41001
> 
> $chr18_gl000207_random
> [1] 4262
> 
> $chr19_gl000208_random
> [1] 92689
> 
> $chr19_gl000209_random
> [1] 159169
> 
> $chr21_gl000210_random
> [1] 27682
> 
> $chrUn_gl000211
> [1] 166566
> 
> $chrUn_gl000212
> [1] 186858
> 
> $chrUn_gl000213
> [1] 164239
> 
> $chrUn_gl000214
> [1] 137718
> 
> $chrUn_gl000215
> [1] 172545
> 
> $chrUn_gl000216
> [1] 172294
> 
> $chrUn_gl000217
> [1] 172149
> 
> $chrUn_gl000218
> [1] 161147
> 
> $chrUn_gl000219
> [1] 179198
> 
> $chrUn_gl000220
> [1] 161802
> 
> $chrUn_gl000221
> [1] 155397
> 
> $chrUn_gl000222
> [1] 186861
> 
> $chrUn_gl000223
> [1] 180455
> 
> $chrUn_gl000224
> [1] 179693
> 
> $chrUn_gl000225
> [1] 211173
> 
> $chrUn_gl000226
> [1] 15008
> 
> $chrUn_gl000227
> [1] 128374
> 
> $chrUn_gl000228
> [1] 129120
> 
> $chrUn_gl000229
> [1] 19913
> 
> $chrUn_gl000230
> [1] 43691
> 
> $chrUn_gl000231
> [1] 27386
> 
> $chrUn_gl000232
> [1] 40652
> 
> $chrUn_gl000233
> [1] 45941
> 
> $chrUn_gl000234
> [1] 40531
> 
> $chrUn_gl000235
> [1] 34474
> 
> $chrUn_gl000236
> [1] 41934
> 
> $chrUn_gl000237
> [1] 45867
> 
> $chrUn_gl000238
> [1] 39939
> 
> $chrUn_gl000239
> [1] 33824
> 
> $chrUn_gl000240
> [1] 41933
> 
> $chrUn_gl000241
> [1] 42152
> 
> $chrUn_gl000242
> [1] 43523
> 
> $chrUn_gl000243
> [1] 43341
> 
> $chrUn_gl000244
> [1] 39929
> 
> $chrUn_gl000245
> [1] 36651
> 
> $chrUn_gl000246
> [1] 38154
> 
> $chrUn_gl000247
> [1] 36422
> 
> $chrUn_gl000248
> [1] 39786
> 
> $chrUn_gl000249
> [1] 38502
> 
> 
> 
> On 7 Feb 2012, at 13:00, Nicolas Delhomme wrote:
> 
>> Hi Francesco,
>> 
>> I'd need more info to help you.
>> 
>> Can you post the header of your bam files and the content of as.list(seqlengths(Hsapiens)).
>> 
>> In the meanwhile can you run easyRNASeq with normalize set to FALSE, no conditions and the outputFormat set to "RNAseq"?
>> 
>> That will give you back the RNAseq object and we can have a look at what is happening in there. 
>> 
>> I get the feeling that it has to do with the exon names being converted into integers (through the use of a factor maybe) but I thought I got that bug squashed already...
>> 
>> Cheers,
>> 
>> Nico
>> 
>> ---------------------------------------------------------------
>> Nicolas Delhomme
>> 
>> Genome Biology Computational Support
>> 
>> European Molecular Biology Laboratory
>> 
>> Tel: +49 6221 387 8310
>> Email: nicolas.delhomme at embl.de
>> Meyerhofstrasse 1 - Postfach 10.2209
>> 69102 Heidelberg, Germany
>> ---------------------------------------------------------------
>> 
>> 
>> 
>> 
>> 
>> On 7 Feb 2012, at 13:37, Lescai, Francesco wrote:
>> 
>>> Hi there,
>>> I'm preparing a tutorial, therefore I sliced two RNASeq bam files from chromosome one, in order to reduce the file size.
>>> 
>>> However, when I count the reads as it follows I have an error.
>>> thanks,
>>> Francesco
>>> 
>>> 
>>>> countDataSet <- easyRNASeq(filesDirectory=getwd(),
>>> +                            organism="HSapiens",
>>> +                            chr.sizes=as.list(seqlengths(Hsapiens)),
>>> +                            readLength=100L,
>>> +                            annotationMethod="biomaRt",
>>> +                            format="bam",
>>> +                            count="exons",
>>> +                            filenames=bamfiles,
>>> +                            normalize=TRUE,
>>> +                            outputFormat="DESeq",
>>> +                            conditions=conditions,
>>> +                           fitType="local"
>>> +                           )
>>> Checking arguments...
>>> Fetching annotations...
>>> Summarizing counts...
>>> Processing SRR349689_1.fastq.novo.new.chr1.bam
>>> Error in FUN(c("GL000229.1", "GL000200.1", "GL000228.1", "GL000241.1",  :
>>> (list) object cannot be coerced to type 'integer'
>>> 
>>> I also tried selecting the chromosome with chr.sel=c("1") or chr.sel=c("chr1") but in both cases I get the error
>>> Make sure that your file is valid, that your 'chr.sel' (if provided) contains valid values; i.e. values as found in the file, not as returned by 'RNAseq'.
>>> 
>>> The alignment has been done using the ENSEMBL reference, i.e. with chromosome listed as "1".
>>> 
>>> sessionInfo()
>>> R Under development (unstable) (2012-01-20 r58146)
>>> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>>> 
>>> locale:
>>> [1] C/en_US.UTF-8/C/C/C/C
>>> 
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>> 
>>> other attached packages:
>>> [1] GenomicFeatures_1.6.7              AnnotationDbi_1.17.15              easyRNASeq_1.1.5
>>> [4] ShortRead_1.13.13                  latticeExtra_0.6-19                RColorBrewer_1.0-5
>>> [7] Rsamtools_1.7.27                   genomeIntervals_1.11.0             intervals_0.13.3
>>> [10] DESeq_1.7.6                        locfit_1.5-6                       lattice_0.20-0
>>> [13] akima_0.5-7                        BiocInstaller_1.3.7                Biobase_2.15.3
>>> [16] edgeR_2.5.9                        limma_3.11.11                      biomaRt_2.11.1
>>> [19] BSgenome.Hsapiens.UCSC.hg19_1.3.17 BSgenome_1.23.2                    Biostrings_2.23.6
>>> [22] GenomicRanges_1.7.16               IRanges_1.13.22                    BiocGenerics_0.1.4
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] DBI_0.2-5          RCurl_1.9-5        RSQLite_0.11.1     XML_3.8-0          annotate_1.33.2
>>> [6] bitops_1.0-4.1     genefilter_1.37.0  geneplotter_1.33.1 grid_2.15.0        hwriter_1.3
>>> [11] rtracklayer_1.15.7 splines_2.15.0     survival_2.36-10   tools_2.15.0       xtable_1.6-0
>>> [16] zlibbioc_1.1.1
>>> 
>>> 
>>> ---------------------------------------------------------------------------------
>>> Francesco Lescai, PhD, EDBT
>>> Senior Research Associate in Genome Analysis
>>> University College London
>>> Faculty of Population Health Sciences
>>> Dept. Genes, Development & Disease
>>> ICH - Molecular Medicine Unit, GOSgene team
>>> 30 Guilford Street
>>> WC1N 1EH London UK
>>> 
>>> email: f.lescai at ucl.ac.uk<mailto:f.lescai at ucl.ac.uk>
>>> phone: +44.(0)207.905.2274
>>> [ext: 2274]
>>> --------------------------------------------------------------------------------
>>> 
>>> 
>>> [[alternative HTML version deleted]]
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
> 
> ---------------------------------------------------------------------------------
> Francesco Lescai, PhD, EDBT
> Senior Research Associate in Genome Analysis 
> University College London
> Faculty of Population Health Sciences
> Dept. Genes, Development & Disease
> ICH - Molecular Medicine Unit, GOSgene team
> 30 Guilford Street
> WC1N 1EH London UK
> 
> email: f.lescai at ucl.ac.uk 
> phone: +44.(0)207.905.2274 
> [ext: 2274] 
> --------------------------------------------------------------------------------
> 



More information about the Bioconductor mailing list