[BioC] ShortRead 3' trimming negative length vectors are not allowed

Martin Morgan mtmorgan at fhcrc.org
Tue May 21 15:44:20 CEST 2013


On 05/17/2013 12:55 PM, Sam McInturf wrote:
> A few follow up questions.
> I wan to trim my reads to keep everything up until the first base with a quality
> below 20.  Is trimEnds() the right function for this, or is trimTail(k = 1, a =
> "see below") the way to go?
>
> In your reply, you used "V" as the cut off, what is that Q score?  My reads use
> an offset of 33 for the qual scores, so ! (ascii = 33) is the lowest score?  Is
> that consistent with what ShortRead is doing?  I would like a cutoff of 20, so
> 33+20 = 53, which is "5", is this what I should be telling trimEnds?

Sorry not to have answered sooner. Yes, it sounds like trimTail is what you're 
after. The letter / quality score encoding depends on class(quality(reads)), with

FastqQuality:

  !  "  #  $  %  &  '  (  )  *  +  ,  -  .  /  0  1  2  3  4  5  6  7  8  9  :
  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J
26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

SFastqQuality:

  ;  <  =  >  ?  @  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T
-5 -4 -3 -2 -1  0  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20

  U  V  W  X  Y  Z  [ \\  ]  ^  _  `  a  b  c  d  e  f  g  h  i
21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41

In response to your question, I made it so that in the 'devel' version of 
Bioconductor / ShortRead

   encoding(FastqQuality())

returns this information. I also added a function filterFastq that allows 
space-efficient filtering from one fastq file to another; there are also 
trimTails and friends that use this to transform a character vector of 'object's 
to new files given by the argument 'destinations'. Again this is available in 
the devel branch of ShortRead.

Thanks for the questions...

Martin


>
> Best
>
>
> On Fri, May 17, 2013 at 1:38 PM, Sam McInturf <smcinturf at gmail.com
> <mailto:smcinturf at gmail.com>> wrote:
>
>     Martin,
>        I was unaware of the trimEnds() function, it worked wonderfully without
>     having to use your modified trimEnds().  I imagine it is because the cluster
>     has a lot of memory available.  I ran it a second time using the
>     modification, but it was much slower.
>
>     Thanks again!
>
>
>     On Fri, May 17, 2013 at 12:20 PM, Martin Morgan <mtmorgan at fhcrc.org
>     <mailto:mtmorgan at fhcrc.org>> wrote:
>
>         On 05/17/2013 09:19 AM, Sam McInturf wrote:
>
>             Hello everyone,
>                 I am trying to do some quality trimming on some RNA seq reads
>             (Illumina,
>             100 bp, single end).  I have been following the pipeline from the UCR
>             manuals<http://manuals.__bioinformatics.ucr.edu/home/__ht-seq#TOC-Trimming-Low-__Quality-3-Tails-from-R
>             <http://manuals.bioinformatics.ucr.edu/home/ht-seq#TOC-Trimming-Low-Quality-3-Tails-from-R>>
>
>             but
>             I have run into an error I can't seem to resolve.
>
>             library(ShortRead)
>             reads <- readFastq("myFile.fastq")
>
>             qualityCutoff <- 10 # remove read tails with quality lower than this
>
>             seqs <- sread(reads) # get sequence list
>
>             qual <- PhredQuality(quality(quality(__reads))) # get quality score
>             list as
>             PhredQuality
>
>                 length(qual)                 #my length is positive
>
>             [1] 39145018
>
>                 myqual_mat <- matrix(charToRaw(as.character(__unlist(qual))),
>
>             nrow=length(qual), byrow=TRUE)
>
>
>         I'm guessing that what's happening is that as.character(unlist(qual)) is
>         making a very large vector sum(width(qual)), larger than can be
>         represented in (your) version of R (the output of sessionInfo() would be
>         helpful...).
>
>         But I don't think you need to go this route;  have you looked at
>         ShortRead::trimEnds & friends? For instance, after running
>
>            exmample(trimEnds)
>
>         I have an object 'rfq' with some qualities
>
>          > quality(rfq)
>         class: SFastqQuality
>         quality:
>            A BStringSet instance of length 256
>                width seq
>            [1]    36 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]__VCHVMPLAS
>            [2]    36 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZPICCK
>            [3]    36 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUXEFLA
>            [4]    36 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZTQLOA
>            [5]    36 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]__MJUJVLSS
>            ...   ... ...
>         [252]    36 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]__MJQNSAOC
>         [253]    36 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]__VTVVRVMSM
>         [254]    36 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZUOOHH
>         [255]    36 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVXEJSJ
>         [256]    36 ]]]]]]]]]]]Y]]T]T]]]]__TRYVMEVVRSRHHNH
>
>         And I can simultaneously trim reads and sequences with
>
>              trimEnds(rfq, "V")
>
>         which you can see in action as
>
>          > quality(trimEnds(rfq, "V"))
>         class: SFastqQuality
>         quality:
>            A BStringSet instance of length 256
>                width seq
>            [1]    27 ]]]]]]]]]]]]Y]Y]]]]]]]]]]]]
>            [2]    31 ]]]]]]]]]]]]Y]]]]]]]]]YPV]T][__PZ
>            [3]    32 ]]]]Y]]]]]V]T]]]]]T]]]]]V]__TMJEUX
>            [4]    31 ]]]]]]]]]]]]]]]]]]]]]]T]]]]__RJRZ
>            [5]    28 ]]]]]]]]]]]]]]]]]T]]]]]]]]]]
>            ...   ... ...
>         [252]    28 ]]]]]]]]]]]Y]]]]]NY]]]]Y]VR]
>         [253]    27 ]]]]]]]]]]]]]]]]]]]]]]]Y]]]
>         [254]    31 ]]]]Y]Y]]]]]]]OYY]]]Y]]]__YYVVTSZ
>         [255]    32 ]]]]]]]]]]]]]]]T]]]]]]]]]V]T[__OVX
>         [256]    24 ]]]]]]]]]]]Y]]T]T]]]]TRY
>
>         The following might be helpful to transform a large file --
>
>         setMethod(trimEnds, "character",
>              function(object, a, left = TRUE, right = TRUE,
>                       relation = c("<=", "=="), ...,
>                       destination, yieldSize=1000000, ranges = FALSE)
>         {
>              if (missing('destination'))
>                  stop("'destination' required")
>              strm <- FastqStreamer(object, yieldSize)
>              tot <- nNuc <- nTrimNuc <- 0L
>              while (length(fq <- yield(strm))) {
>                  tot <- tot + length(fq)
>                  nNuc <- nNuc + sum(width(fq))
>                  fq <- trimEnds(fq, a, left, right, relation, ..., ranges=ranges)
>                  nTrimNuc <- nTrimNuc + sum(width(fq))
>                  writeFastq(fq, destination, "a")
>              }
>              list(destination=destination,
>                   c(TotalReads=tot, TotalNucleotides=nNuc,
>                     TrimmedNucleotides=nNuc - nTrimNuc))
>         })
>
>         provide the first argument to trimEnds as a simple character vector, and
>         provide a 'destination' file name. For example
>
>            ## just a convenient path to a fastq file
>            fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
>            trimEnds(fl, "V", destination=tempfile())
>
>         A variation of this will be added to the 'devel' version of ShortRead,
>         so if there are suggestions for improvements please let me know.
>
>         Martin
>
>
>
>
>             Error in .Call(.NAME, ..., PACKAGE = PACKAGE) :
>                 negative length vectors are not allowed
>
>             as(qual, matrix) gives the same error, and as.matrix(qual) does as well,
>             not sure how those two commands are different, but tried anyway.
>
>             If I run the exact same commands, instead of the full 39145018
>             reads, I use
>             the first 100, it works like a charm.  I am working on a linux
>             cluster, if
>             that is important
>
>             Does anyone have an idea?
>
>             Cheers!
>
>
>
>         --
>         Computational Biology / Fred Hutchinson Cancer Research Center
>         1100 Fairview Ave. N.
>         PO Box 19024 Seattle, WA 98109
>
>         Location: Arnold Building M1 B861
>         Phone: (206) 667-2793 <tel:%28206%29%20667-2793>
>
>
>
>
>     --
>     Sam McInturf
>
>
>
>
> --
> Sam McInturf


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list