[BioC] Trouble creating a ShortReadQ object and using writeFastq

Thu Feb 21 00:08:19 CET 2013

Hi Noah --

On 2/20/2013 2:47 PM, Noah Dowell wrote:
> Hello All,
>
> I am trying to do some preprocessing quality analysis on PacBio reads.  To make any plots of quality score or nucleotide frequency one should only look at reads of the same length.  One output from PacBio filtering of the raw data is a FASTQ file.  The nature of the PacBio reads is that they lack uniformity in their length dimension.   I am pretty sure the file is okay because the readSeqFile() function in the QRQC package is able to read the data in and the FASTQSummary object can be used to make plots.  Given the non-uniform length of reads though those plots show artifacts.  As far as can tell the FASTQSummary object can not be manipulated like a ShortRead or BioStrings object.
>
> So I turn to the ShortRead package.  I have also successfully used the readFastq() function in  ShortRead package to read in Illumina reads of different length and then preprocess or slice-n-dice those reads.  I am getting the following error though when I try this on PacBio fastq files:
> #####################################################################################
>> dirPath = "/Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data/"
>> pat1 = "filteredSTD_12cells.fastq"
>> seqs1 = readFastq(dirPath = dirPath, pattern = pat1, file = "filteredSTD_12cells.fastq")
> Error: Input/Output
>    file(s):
>      /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq
>    message: line too long /Users/ndowell/Documents/PacBio/130204_NLD1_XL_12cells_255/017249_12cells_filtered/data//filteredSTD_12cells.fastq:17521

is there something unusual about this line of this file, e.g., blank or 
otherwise? How long are reads, anyway? Maybe you can trigger this in a subset of 
the file with just one or two records, and you'd be willing to share that with 
me? The error could come from one of two places in the C code, and both are 
preceded by a comment 'this should never happen'

>
>
> #####################################################################################
>
> So I tried to manually build a ShortReadQ file using the following (there are generally 4 distinct lines in a FASTQ file):
> #workaround to error with too long lines
> seqTemp <- readLines("filteredSTD_12cells.fastq")
> seqs <- DNAStringSet(seqTemp[c(FALSE, TRUE, FALSE, FALSE)])
> ids <- BStringSet(seqTemp[c(TRUE, FALSE, FALSE, FALSE)])
> qual <- BStringSet(seqTemp[c(FALSE, FALSE, FALSE, TRUE)])
>
> SeqClean <- ShortReadQ(sread=seqs, quality = qual, id = ids)
>
> #This worked!!:
> length(SeqClean) #484232
> summary(width(SeqClean))
> Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
> 50    1368    2153    2785    3567   26940
>
> #But I get this error when I try to write the file (changing withIds to TRUE or full to TRUE gives same error):
>> writeFastq(SeqClean, filepath = "cleanReads.fastq", format = "fastq", mode = "w", full = FALSE, withIds = FALSE)
> Error in function (classes, fdef, mtable)  :
>    unable to find an inherited method for function ‘writeFastq’ for signature ‘"ShortReadQ", "missing"’

 > args(writeFastq)
function (object, file, mode = "w", full = FALSE, ...)
NULL

so I think your 'filepath' should be file="cleanReads.fastq".

>
> #I think something is not quite right with my construction of the ShortReadQ object and specifically the read IDs
> #I wonder if it is because it is slotting the ids into the wrong place:
>> head(ids)
>    A BStringSet instance of length 6
>      width seq
> [1]    72 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/21/0_5104
> [2]    75 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/2430_3193
> [3]    75 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/3237_6580
> [4]    75 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/53/6626_9948
> [5]    72 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/54/0_4929
> [6]    74 @m130205_030957_42152_c100453962550000001523068903201342_s1_p0/81/276_2420
> #####################################################################################
>
> My questions:
> 1. Is there any way to leverage the ability to slice-n-dice a FASTQ data set in the ShortRead package and then write to a new FASTQ file?
>
> 2 Should I just try to add in names to my sequences manually (to the ShortReadQ object) to alleviate the 'mssing' error above?
>
> 3. Is there any desire from the developers to merge the id() and names() functions?

would have been a good choice, in retrospect ;)

Martin

>
>
> Any help/hints would be appreciated.  I apologize for the lack of a toy-reproducible example; I was unsuccessful in my attempt to create a toy quality score.  Thank you in advance for your assistance and time.
>
> Best,
> Noah
>
>
>
>
>
>
>
>
> #####################################################################################
>
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>
> attached base packages:
> [1] grid      stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>   [1] qrqc_1.12.0          testthat_0.7         xtable_1.7-0         brew_1.0-6           biovizBase_1.6.2     ggplot2_0.9.3
>   [7] reshape_0.8.4        plyr_1.8             org.Hs.eg.db_2.8.0   RSQLite_0.11.2       DBI_0.2-5            AnnotationDbi_1.20.3
> [13] Biobase_2.18.0       BiocInstaller_1.8.3  ShortRead_1.16.3     latticeExtra_0.6-24  RColorBrewer_1.0-5   Rsamtools_1.10.2
> [19] lattice_0.20-13      Biostrings_2.26.3    GenomicRanges_1.10.6 IRanges_1.16.5       BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
>   [1] biomaRt_2.14.0         bitops_1.0-4.2         BSgenome_1.26.1        cluster_1.14.3         colorspace_1.2-1
>   [6] dichromat_2.0-0        digest_0.6.2           evaluate_0.4.3         GenomicFeatures_1.10.1 gtable_0.1.2
> [11] Hmisc_3.10-1           hwriter_1.3            labeling_0.1           MASS_7.3-23            munsell_0.4
> [16] parallel_2.15.2        proto_0.3-10           RCurl_1.95-3           reshape2_1.2.2         rtracklayer_1.18.2
> [21] scales_0.2.3           stats4_2.15.2          stringr_0.6.2          tools_2.15.2           XML_3.95-0.1
> [26] zlibbioc_1.4.0
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109