[BioC] CIGAR and seq lengths differ after trimming

Martin Morgan mtmorgan at fhcrc.org
Fri May 31 01:53:55 CEST 2013


On 05/29/2013 04:02 PM, Sam McInturf wrote:
> Bioconductor,
>    I am working on an RNA sequencing experiment on Arabidopsis (Illumina,
> 100bp, single end), using tophat to map the reads, and R for the rest.
>   Many (9 of 12) libraries worked with out error.  Three of the libraries
> had no `accepted_hit.bam` file, while they did have a junctions.bed, and
> had an error of "Error: could not convert to BAM with samtools".  in the
> standard error output.
>
> Additionally in tophat_out/log/accepted_hits_sam_to_bam.log
> [samopen] SAM header is present: 7 sequences.
> Parse error at line 19971338: CIGAR and sequence length are inconsistent

Hi Sam -- maybe you can discover what this line of the sam file looks like, and 
compare the sequence there to the sequence in the fastq file, both before and 
after trimming. Martin

>
> Obviously, somewhere the CIGAR  or sequence lengths were altered and no
> longer 'mesh'.
> I run :
>
> library(ShortRead)
> reads <- readFastq("myFile")
> treads <- trimEnds(reads, a = "5")     #trim at a quality score of 20
> nreads <- treads[width(reads) > 21]    # keep only reads longer than 21 bp
> writeFastq(nreads, "myNewFile.fastq")
>
> then a 'normal' (few options changed) tophat call
> tophat -p 4 -o outdir -G my.gtf         BowtieIndex        myNewFile.fastq
>
> Any ideas on what went wrong, and why it didn't work some of the time?
>   Should I just rerun the trimming scripts and see if they map afterwards?
>
>
> Thanks in advance!
>
>> sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] ShortRead_1.18.0     latticeExtra_0.6-24  RColorBrewer_1.0-5
> [4] Rsamtools_1.12.3     lattice_0.20-15      Biostrings_2.28.0
> [7] GenomicRanges_1.12.3 IRanges_1.18.1       BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.20.0 bitops_1.0-5   grid_3.0.0     hwriter_1.3
>   stats4_3.0.0
> [6] zlibbioc_1.6.0
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list