[BioC] A problem about trimLRPatterns

Fri Mar 2 18:39:19 CET 2012

On Mar 2, 2012, at 9:48 AM, wang peter wrote:
> dear Harris
> thank you for your perfect example: but i still have 3 small questions:
> 1. when j=15, s="GAATAGTACTGTAGGCACCATCAATAGATCGGAA"
> and p = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTT"
> and the edit distance between s and p is 16, not 8

Actually, not so perfect.  My code was not quite right, and that may be
the source of your question 1, although I really can't be sure what you
are asking.  Let me try again now.

	subject =   "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA"
	pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG"

Notice that the pattern has to start 2 positions before the subject, in
order to end at the same place.  So my loop should really start at j=-1
to reflect everything that trimLRPatterns tests.  Moreover, the subject
CAN NOT be shortened in the loop; only the pattern can be (and must be)
shortened:

	N = sapply(-1:nchar(subject), function(j) {
		p = substr(pattern, 1, nchar(subject)-j+1)
		neditEndingAt(ending.at=nchar(subject),
			pattern=p, subject=subject, with.indels=TRUE)
	})

The subject can not be shortened, to "s" as I was doing before, because
an insert may be desirable in matching p, and so the subject string to
which it may match, fuzzily, may be longer than s.

Now the result is:

	> N
	[1] 24 23 22 21 20 19 18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 0
	[26] 1 2 3 4 5 6 7 8 9 8 8 9 8 7 7 6 5 5 4 3 2 3  3 2 1 

To get us located explicitly on the subject, set

	names(N) = -1:nchar(subject)

Then
	> as.integer(N["15"])
	[1] 8
	> as.integer(N["23"])
	[1] 0 

So, where is the your edit distance of 16?  See more details below.

>> subject = "TATAGTAGATATTGGAATAGTACTGTAGGCACCATCAATAGATCGGAA"
>> pattern = "CTGTAGGCACCATCAATAGATCGGAAGAGCGGTTCAGAAGGAATGCCGAG"
>> 
>> sapply(15:nchar(subject), function(j) {
>>        s = substr(subject, j, nchar(subject))
>>        p = substr(pattern, 1, nchar(subject)-j+1)
>>        neditEndingAt(ending.at=nchar(s), pattern = p, subject = s, with.indels=TRUE)
>> })
>> 
>>  [1]  8  7  6  5  4  3  2  1  0  2  4  6  8 10 11 11 10  9  8  8  9  8  7  7  6
>> [26]  5  5  4  3  2  4  3  2  1
> 
> 2. if the trimLRPatterns try to trim the longest substring in the
> scope of mismatch number,
> it will remove some bp which are not noise? right?

Your mismatch limits:

	M = as.integer(rev(0.25*1:nchar(pattern))) 
	> M
	[1] 12 12 12 11 11 11 11 10 10 10 10 9 9 9 9 8 8 8 8 7 7 7 7 6 6
	[26] 6 6 5 5 5 5 4 4 4 4 3 3 3 3 2 2 2 2 1 1 1 1 0 0  0

We are looking for the first time, in terms of increasing j, that the edit
distance N is <= your mismatch limit M.  This happens at j=15 (at the 17th
element of N), and the edit distance at this point is 8:

	> match(TRUE, N<=M)
	[1] 17
	> N[17]
	15 
	 8

This is confirmed by the following call, which is how trimLRPatterns does it:

	> which.isMatchingEndingAt(ending.at=nchar(subject), pattern=pattern,
		subject=subject, max.mismatch=M, with.indels=TRUE,
		auto.reduce.pattern=TRUE)
	[1] 17

> 3. so the trimLRPatterns algorithm is based on the edit distance, right? i think
> it uses dynamic programming to calculate the Levenshtein distance.
> but it seems much faster than my program which also uses dynamic programming

For a large subject, i.e. consisting of very many (short) reads, the
"auto.reduce.pattern=TRUE" feature in which.isMatchingEndingAt() allows
trimLRPatterns to make a single call to C to process all the reads.  For each
read, the looping in C (analogous to all the R code I've written above) can,
and does, stop at the first acceptable match, while of course this stopping
point will vary from read to read, so can't be known in advance.  This might
account for some of the timing you are seeing.

I cannot speak for why Herve's distance calculation, which is used by the
looping I've just sketched out, is fast, or faster than your method.  For
details see Biostrings/src/lowlevel_matching.c, in this case the function
_nedit_for_Proffset().

> thank you very much
> shan