[BioC] Phred encoding

Martin Morgan mtmorgan at fhcrc.org
Thu Aug 15 19:05:17 CEST 2013


On 08/15/2013 09:49 AM, Taylor, Sean D wrote:
> Thanks for the response Vincent. I'm afraid I still don't understand what as.raw() is doing differently. Did part of your reply get cut off? After consulting ?as.raw() I still would have expected the answer that as.numeric() is generating (based on my limited understanding).
>

I'd be interested in knowing what your objective is -- drop reads with some low 
quality calls? drop reads with overall low quality? trim reads of low quality 
heads / tails?

Anway, 'unlist' strips the 'Quality' class

 > unlist(qual)
   10-letter "BString" instance
seq: BBBBBFFB4!

so any operations are based on 'BString' without reference to encoding. 
as.integer / as.numeric then return the ascii symbol 
(http://www.asciitable.com/) of the corresponding letter

 > as.integer(unlist(qual))
  [1] 66 66 66 66 66 70 70 66 52 33
 > as.numeric(unlist(qual))
  [1] 66 66 66 66 66 70 70 66 52 33

as.raw on a BString returns the raw (hexadecimal) representation of the ascii 
encoding

 > selectMethod(as.raw, "BString")
Method Definition:

function (x)
as.raw(as.integer(x))
<environment: namespace:XVector>

Signatures:
         x
target  "BString"
defined "XRaw"

 > as.raw(1:100)
   [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19
  [26] 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32
  [51] 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b
  [76] 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64
 > as.raw(66)
[1] 42

 From ?PhredQuality and rectangular data perhaps you'd like

   rowSums(as(qual, "matrix") < 23) == 0

or for irregular data

   rowSums(as(FastqQuality(qual), "matrix") < 23, na.rm=TRUE) == 0

also ?trimTails in ShortRead might be relevant.

Martin



> From: Vincent Carey [mailto:stvjc at channing.harvard.edu]
> Sent: Wednesday, August 14, 2013 6:32 PM
> To: Taylor, Sean D
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] Phred encoding
>
> from ?as.raw
>
>   A raw vector is printed with each byte separately represented as a
>       pair of hex digits.  If you want to see a character representation
>
>
> On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor at fhcrc.org<mailto:sdtaylor at fhcrc.org>> wrote:
> Hi,
>
> I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results?
>
> qual<-PhredQuality(c("BBBBBFFB4!"))
> as.raw(unlist(qual))
> as.numeric(unlist(qual))
>
> Thanks,
> Sean
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


-- 
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list