[BioC] A problem about trimLRPatterns

Harris A. Jaffee hj at jhu.edu
Fri Mar 2 01:00:27 CET 2012


I'm forwarding back to the list so that others can check my answer.

What is going on is something to this effect.

subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA"
pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG"

sapply(15:nchar(subject), function(j) {
	s = substr(subject, j, nchar(subject))
	p = substr(pattern, 1, nchar(subject)-j+1)
	neditEndingAt(ending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE)
})

 [1]  8  7  6  5  4  3  2  1  0  2  4  6  8 10 11 11 10  9  8  8  9  8  7  7  6
[26]  5  5  4  3  2  4  3  2  1

(I started my sapply at j=15 to reduce the noise for j less than 15.  You
can start at j=1 if you like.  Those edit values will be larger than 8 and
thus too large given your limits.)

When you say "should be", you are focusing on the smallest number (0) in this
vector, which I admit reflects the smallest edit distance between the strings
s and p, as s and p vary with j, with j going from 15 to nchar(subject).

But trimLRPatterns is designed to perform the longest trim possible given your
mismatch limits, even if that is not the one with minimal edit distance.

I think you will find that edit distance 8 (the first element in the vector,
corresponding to j=15) is within your mismatch limits.  This is why the function
trims from position 15 on, leaving the 14-character result, "TATAGTAGATATTG".

Does that explain it?

On Mar 1, 2012, at 5:48 PM, wang peter wrote:

> can you try this sample
> the adapter is PCR2rc CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG
> 
> the function is
> max.mismatchs <- 0.25*1:nchar(DNAString(PCR2rc))
> trimmedCoords <- trimLRPatterns(Rpattern = PCR2rc, subject =
> sread(highQuaReads), max.Rmismatch= max.mismatchs,
> with.Rindels=T,ranges=T)
> 
> highQuaReads is one read:
> 
> @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA
> TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA
> +
> :BDDFDHHGHHJJJIIJIIIIEIGIJJJBGHIFEEH9EGIBHEHG6:8
> 
> 
> and the results is
> 
> @HWI-ST132:506:D0CNUABXX:3:1101:1576:1834 1:N:0:TTCACA
> TATAGTAGATATTG
> +
> :BDDFDHHGHHJJJ
> 
> 
> u see
> it should be
> TATAGTAGATATTGGAATAGTA
> 
> i have many samples like this
> 
> -- 
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253



More information about the Bioconductor mailing list