[BioC] trim adapter with mismatch tolerance

Mon Dec 19 22:14:47 CET 2011

See below.

On Dec 19, 2011, at 2:43 PM, Kunbin Qu wrote:
> Harris, the followings are my code and intention and thanks a lot  
> for your help!
>
> For read 2, it will be trimmed
> For read 3, there is a deletion at the 4th bp, CTG-CC, it was not  
> trimmed by my code, but I wanted to
> For read 4, there are two mismatches starting from the 4th bp:  
> CTGCAC, comparing with CTGGCC. I wanted this to be trimmed too, but  
> it did not
>
> -Kunbin
>
>
> Reads input:
>
> @SEA055486C_0096:7:1101:2898:2204#ACAGTG/1
> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCCCGCAGAGGGTTGTATTGGTTCGGCACCATGCCGCTC 
> TGCAGCCGGGACAGCCACTCGCA
> +
> gggggiiihihiiiiiiiiiiiihiiihiiiiiiigeecW`cc 
> \`aabcdcccccccccccccccccaaccccccccccaaccccccccbcc]
> @SEA055486C_0096:7:1101:3620:2209#ACAGTG/1
> CTGGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG 
> GCTC
> +
> ccccc_ad`[``ed_bRP^aZbccbbc\_cccccchhhdbb`bM\`\``bbc]V^RZ^____\^\^W 
> [YY^^^^
> @SEA055486C_0096:7:1101:3316:2429#ACAGTG/1
> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGG 
> CTCCAGCAGCGGAGGGGCCCTG
> +
> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi 
> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]a
> @SEA055486C_0096:7:1101:3316:2439#ACAGTG/1
> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGG 
> GCTCCAGCAGCGGAGGGGCCCTG
> +
> gggggiiiiiiiiiiiiegiiiiiihiiiih`eghi 
> [ghhiiiiiihigggggfeeeeccddccccccccc`bcccccccaccT]acaaa]aa
>
>
> code:
>
> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
> reads<-readFastq("t2.fq")
> seqs<-sread(reads)
> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,  
> max.Lmismatch = 1, ranges=T)
> seqsAR <- DNAStringSet(seqs, start=start(trimCoordsAR), end=end 
> (trimCoordsAR))
> seqsAR
>   A DNAStringSet instance of length 5
>     width seq
> [1]    93  
> CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
> [2]    49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
> [3]    92  
> CTGCCTTTGAGACTGGCCCACCTCGTCTCTCCCA...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
> [4]    93  
> CTGCACTTTGAGACTGGCCCACCTCGTCTCTCCC...ACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG

Basically, you just need (1) to explicitly allow indels, since  
unfortunately the
default is not to allow them, and (2) your mismatch tolerance isn't  
large enough.

To get your feet wet, start with some exploratory data analysis such as:

 > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs)
      [,1] [,2] [,3] [,4]
[1,]   16    0   15    2

 > neditStartingAt(starting.at=1, pattern=adapter, subject=seqs,  
with.indels=TRUE)
      [,1] [,2] [,3] [,4]
[1,]   12    0    1    2

So then, aside from possible non-calls ("N"), of which there aren't  
any in these
4 reads, we can now see pretty much what we need at least for reads  
2, 3, and 4:

 > trimLRPatterns(Lpattern=adapter, subject=seqs, max.Lmismatch=2,  
with.Lindels=TRUE)
   A DNAStringSet instance of length 4
     width seq
[1]    93  
CTGTGTGCTCAGGGGGCCTGGTGCCACACTCCCC...CATGCCGCTCTGCAGCCGGGACAGCCACTCGCA
[2]    49 GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTC
[3]    67  
TCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG
[4]    68  
GTCTCTCCCACCTGCCTGTGGCCTCTGGCACCAGCACCGTCCTGGGCTCCAGCAGCGGAGGGGCCCTG

Read 1 could be more subtle, or (more likely, I guess) just does not  
contain any
adapter to trim in the first place.

>> sessionInfo()
> R version 2.14.0 (2011-10-31)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
> other attached packages:
> [1] ShortRead_1.12.3    latticeExtra_0.6-19 RColorBrewer_1.0-5
> [4] Rsamtools_1.6.3     lattice_0.20-0      Biostrings_2.22.0
> [7] GenomicRanges_1.6.4 IRanges_1.12.5
>
> loaded via a namespace (and not attached):
>  [1] Biobase_2.14.0     bitops_1.0-4.1     BSgenome_1.22.0     
> grid_2.14.0
>  [5] hwriter_1.3        RCurl_1.8-0        rtracklayer_1.14.4  
> tools_2.14.0
>  [9] XML_3.6-2          zlibbioc_1.0.0
>>
>
> -----Original Message-----
> From: Harris A. Jaffee [mailto:hj at jhu.edu]
> Sent: Monday, December 19, 2011 4:58 AM
> To: Kunbin Qu
> Cc: bioconductor at r-project.org
> Subject: Re: [BioC] trim adapter with mismatch tolerance
>
> Please give a read example and exactly what you want to happen and  
> not happen, and, optionally, exactly you tried and what did happen,  
> along with the output of sessionInfo() in case that may be  
> relevant.  The spec and examples in the Biostrings::trimLRPatterns  
> help page are not ideal.
>
> You are correct that a single integer in max.Lmismatch should  
> prevent matches/trimming by anything but the entire Lpattern.
>
> I can't tell whether you are saying that you need with.Lindels=TRUE  
> or Lfixed="pattern".
>
> On Dec 18, 2011, at 11:38 PM, Kunbin Qu wrote:
>
>> Hi,
>>
>> I have some reads with 25 bp adapter at the five prime end. Some of
>> the reads do not have perfect match with the adapter, even with some
>> indels.  I tried to use trimLRPatterns to trim them off, but I got
>> confused with the max.Lmismatch parameter, as it is doing sub- string
>> trimming too. The adapter in my reads are  the full length of 25 bp,
>> without shortening to any substring. How should I specify the
>> parameter then? Or maybe I could use some other functions from
>> ShortRead? Please help. Thanks.
>>
>> -Kunbin
>>
>> code I used, but this code cannot accommodate the mismatches.
>>
>> reads<-readFastq("test.fq")
>> adapter<-DNAString("CTGGCCTTTGAGACTGGCCCACCTC")
>> seqs<-sread(reads)
>> trimCoordsAR<-trimLRPatterns(Lpattern=adapter, subject=seqs,
>> max.Lmismatch =1, ranges=T)
>>
>> _____________________________________________________________________ 
>> _
>> The contents of this electronic message, including any attachments,
>> are intended only for the use of the individual or entity to which
>> they are addressed and may contain confidential information. If you
>> are not the intended recipient, you are hereby notified that any use,
>> dissemination, distribution, or copying of this message or any
>> attachment is strictly prohibited. If you have received this
>> transmission in error, please send an e-mail to
>> postmaster at genomichealth.com and delete this message, along with any
>> attachments, from your computer.
>> 	[[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/
>> gmane.science.biology.informatics.conductor
>
>
> ______________________________________________________________________
> The contents of this electronic message, including any attachments,  
> are intended only for the use of the individual or entity to which  
> they are addressed and may contain confidential information. If you  
> are not the intended recipient, you are hereby notified that any  
> use, dissemination, distribution, or copying of this message or any  
> attachment is strictly prohibited. If you have received this  
> transmission in error, please send an e-mail to  
> postmaster at genomichealth.com and delete this message, along with  
> any attachments, from your computer.