[BioC] Phred encoding

Taylor, Sean D sdtaylor at fhcrc.org
Fri Aug 16 18:50:37 CEST 2013


Martin and others,

I'm following a methodology in a paper for some NGS sequence analysis (He et al. Nature, 2010, vol 464, p610). In this paper, they apply three quality filters to the reads (36 bp reads from Genome Analyser II instrument):
1. all 36 bases in a read are required to have a Phred score of at least 23
2. No 'N' base was allowed anywhere in the 36 bases
3. not more than 3 mismatches permitted in the 36 bases.
After quality filtering, a mismatch was categorized as a true mutation if
1. the identical mutation was identified in 10 distinct reads (i.e. read whose first base was a t10 different positions)
2. the identical mutation was identified in at least 3 forward reads and 3 reverse reads.

For my data, I'm using 151bp reads from Illumina MySeq, some of these numbers are adjusted, but this is the overall analysis that I am attempting.

My question pertained to the first quality filter. The solution that I came up with implements as.numeric(), but I wasn't sure why as.raw() didn't work.

###############
#filter reads by quality
##############
gal<-readGAlignments(bamfile, use.names=TRUE, param=param)
qual <- mcols(gal)$qual
qualNum<-lapply(qual, function(x) as.numeric(x)-33) #list of quality strings
idx<-sapply(seq(length(qual)), function(i) length(which(qualNum[[i]]<qualityThreshold))==0) 
#returns true for any quality string that does not have a quality under the threshold
galq<-gal[idx]


For what its worth here are my solutions for the other quality filtering criteria. If anyone has a more efficient implementation, that would be really cool. I have solutions for the mutation calling that I'm still hammering out the details on. 
############
#filter reads by Ns
############
qseq<-setNames(mcols(galq)$seq, names(galq))
numberOfN<-as.vector(letterFrequency(qseq,"N"))
galqn<-galq[which(numberOfN==0)]

###########
#filter reads by mismatches, 
#no more than 2 mismatches allowed in first 50 bp
#no more than 3 mismatches allowed in entire read
#############
qseq<-mcols(galqn)$seq
qseq_on_ref <- sequenceLayer(qseq, cigar(galqnl),
                             from="query", to="reference")
pos<-start(galqn)
ends<-end(galqn)

mismatch.index<-sapply(seq(length(qseq_on_ref)), function(x){
  alignment<-append(subseq(refseq, start=pos[[x]], end=ends[[x]]), 
                    qseq_on_ref[x])
  cm<-consensusMatrix(alignment, as.prob=T)
  cs<-DNAString(consensusString(cm, ambiguityMap='N', threshold=0.5))
  if(!letterFrequency(cs[1:50], "N")<=2) return (F)
  return (letterFrequency(cs, "N")<=3)  
})
galqnm<-galqn[mismatch.index]


As always, thanks to all for your patient explaining of things to me. This mailing list is extraordinarily helpful and I hope that I have not abused it with my questioning.

Thanks,

Sean David Taylor
Postdoctoral Fellow
Fred Hutchinson Cancer Research Center
206-667-5544

________________________________________
From: Martin Morgan [mtmorgan at fhcrc.org]
Sent: Thursday, August 15, 2013 10:05 AM
To: Taylor, Sean D
Cc: Vincent Carey; bioconductor at r-project.org
Subject: Re: [BioC] Phred encoding

On 08/15/2013 09:49 AM, Taylor, Sean D wrote:
> Thanks for the response Vincent. I'm afraid I still don't understand what as.raw() is doing differently. Did part of your reply get cut off? After consulting ?as.raw() I still would have expected the answer that as.numeric() is generating (based on my limited understanding).
>

I'd be interested in knowing what your objective is -- drop reads with some low
quality calls? drop reads with overall low quality? trim reads of low quality
heads / tails?

Anway, 'unlist' strips the 'Quality' class

 > unlist(qual)
   10-letter "BString" instance
seq: BBBBBFFB4!

so any operations are based on 'BString' without reference to encoding.
as.integer / as.numeric then return the ascii symbol
(http://www.asciitable.com/) of the corresponding letter

 > as.integer(unlist(qual))
  [1] 66 66 66 66 66 70 70 66 52 33
 > as.numeric(unlist(qual))
  [1] 66 66 66 66 66 70 70 66 52 33

as.raw on a BString returns the raw (hexadecimal) representation of the ascii
encoding

 > selectMethod(as.raw, "BString")
Method Definition:

function (x)
as.raw(as.integer(x))
<environment: namespace:XVector>

Signatures:
         x
target  "BString"
defined "XRaw"

 > as.raw(1:100)
   [1] 01 02 03 04 05 06 07 08 09 0a 0b 0c 0d 0e 0f 10 11 12 13 14 15 16 17 18 19
  [26] 1a 1b 1c 1d 1e 1f 20 21 22 23 24 25 26 27 28 29 2a 2b 2c 2d 2e 2f 30 31 32
  [51] 33 34 35 36 37 38 39 3a 3b 3c 3d 3e 3f 40 41 42 43 44 45 46 47 48 49 4a 4b
  [76] 4c 4d 4e 4f 50 51 52 53 54 55 56 57 58 59 5a 5b 5c 5d 5e 5f 60 61 62 63 64
 > as.raw(66)
[1] 42

 From ?PhredQuality and rectangular data perhaps you'd like

   rowSums(as(qual, "matrix") < 23) == 0

or for irregular data

   rowSums(as(FastqQuality(qual), "matrix") < 23, na.rm=TRUE) == 0

also ?trimTails in ShortRead might be relevant.

Martin



> From: Vincent Carey [mailto:stvjc at channing.harvard.edu]
> Sent: Wednesday, August 14, 2013 6:32 PM
> To: Taylor, Sean D
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] Phred encoding
>
> from ?as.raw
>
>   A raw vector is printed with each byte separately represented as a
>       pair of hex digits.  If you want to see a character representation
>
>
> On Wed, Aug 14, 2013 at 6:25 PM, Taylor, Sean D <sdtaylor at fhcrc.org<mailto:sdtaylor at fhcrc.org>> wrote:
> Hi,
>
> I'm trying to quality filter my NGS reads and want to filter out reads that have bases below a quality threshold (say 23 for instance, using Illumina MiSeq with an offset of 33). Can anyone tell me why the results of the two functions as.raw() and as.numeric() give different results?
>
> qual<-PhredQuality(c("BBBBBFFB4!"))
> as.raw(unlist(qual))
> as.numeric(unlist(qual))
>
> Thanks,
> Sean
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org<mailto:Bioconductor at r-project.org>
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
>
>       [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>


--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the Bioconductor mailing list