[BioC] ShortRead 3' trimming negative length vectors are not allowed

Martin Morgan mtmorgan at fhcrc.org
Fri May 17 19:20:56 CEST 2013


On 05/17/2013 09:19 AM, Sam McInturf wrote:
> Hello everyone,
>    I am trying to do some quality trimming on some RNA seq reads (Illumina,
> 100 bp, single end).  I have been following the pipeline from the UCR
> manuals<http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Trimming-Low-Quality-3-Tails-from-R>
> but
> I have run into an error I can't seem to resolve.
>
> library(ShortRead)
> reads <- readFastq("myFile.fastq")
>
> qualityCutoff <- 10 # remove read tails with quality lower than this
>
> seqs <- sread(reads) # get sequence list
>
> qual <- PhredQuality(quality(quality(reads))) # get quality score list as
> PhredQuality
>
>> length(qual)                 #my length is positive
> [1] 39145018
>
>> myqual_mat <- matrix(charToRaw(as.character(unlist(qual))),
> nrow=length(qual), byrow=TRUE)

I'm guessing that what's happening is that as.character(unlist(qual)) is making 
a very large vector sum(width(qual)), larger than can be represented in (your) 
version of R (the output of sessionInfo() would be helpful...).

But I don't think you need to go this route;  have you looked at 
ShortRead::trimEnds & friends? For instance, after running

   exmample(trimEnds)

I have an object 'rfq' with some qualities

 > quality(rfq)
class: SFastqQuality
quality:
   A BStringSet instance of length 256
       width seq
   [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]VCHVMPLAS
   [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZPICCK
   [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUXEFLA
   [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZTQLOA
   [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]MJUJVLSS
   ...   ... ...
[252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]MJQNSAOC
[253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]VTVVRVMSM
[254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZUOOHH
[255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVXEJSJ
[256]    36 ]]]]]]]]]]]Y]]T]T]]]]TRYVMEVVRSRHHNH

And I can simultaneously trim reads and sequences with

     trimEnds(rfq, "V")

which you can see in action as

 > quality(trimEnds(rfq, "V"))
class: SFastqQuality
quality:
   A BStringSet instance of length 256
       width seq
   [1]    27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]
   [2]    31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][PZ
   [3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]TMJEUX
   [4]    31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]RJRZ
   [5]    28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]
   ...   ... ...
[252]    28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]
[253]    27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]
[254]    31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]YYVVTSZ
[255]    32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[OVX
[256]    24 ]]]]]]]]]]]Y]]T]T]]]]TRY

The following might be helpful to transform a large file --

setMethod(trimEnds, "character",
     function(object, a, left = TRUE, right = TRUE,
              relation = c("<=", "=="), ...,
              destination, yieldSize=1000000, ranges = FALSE)
{
     if (missing('destination'))
         stop("'destination' required")
     strm <- FastqStreamer(object, yieldSize)
     tot <- nNuc <- nTrimNuc <- 0L
     while (length(fq <- yield(strm))) {
         tot <- tot + length(fq)
         nNuc <- nNuc + sum(width(fq))
         fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges)
         nTrimNuc <- nTrimNuc + sum(width(fq))
         writeFastq(fq, destination, "a")
     }
     list(destination=destination,
          c(TotalReads=tot, TotalNucleotides=nNuc,
            TrimmedNucleotides=nNuc - nTrimNuc))
})

provide the first argument to trimEnds as a simple character vector, and provide 
a 'destination' file name. For example

   ## just a convenient path to a fastq file
   fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
   trimEnds(fl, "V", destination=tempfile())

A variation of this will be added to the 'devel' version of ShortRead, so if 
there are suggestions for improvements please let me know.

Martin



> Error in .Call(.NAME, ..., PACKAGE = PACKAGE) :
>    negative length vectors are not allowed
>
> as(qual, matrix) gives the same error, and as.matrix(qual) does as well,
> not sure how those two commands are different, but tried anyway.
>
> If I run the exact same commands, instead of the full 39145018 reads, I use
> the first 100, it works like a charm.  I am working on a linux cluster, if
> that is important
>
> Does anyone have an idea?
>
> Cheers!
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list