[BioC] a problem of trimLRPatterns still confused me

Sat Dec 1 15:34:51 CET 2012

See below.

On Nov 30, 2012, at 3:36 PM, Wang Peter wrote:

> thank you very much, Harris,you helped me again
> 
> now i understand, see the below
> 
> max.mismatchs <- 0.2*1:nchar(Rpattern)
> subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG"
> 
> Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
> 
> sapply((nchar(subject)-nchar(Rpattern)+1):nchar(subject), function(j) {
>        s = substr(subject, j, nchar(subject))
>        p = substr(Rpattern, 1, nchar(subject)-j+1)
>        neditEndingAt(ending.at=nchar(s), pattern = p, subject = s,
> with.indels=TRUE)
> })
> 
> all distance
> [1] 32 33 33 32 31 32 31 30 29 28 27 26 27 26 25 25 24 23 22 22 21 20 20 20
> [25] 20 19 18 17 18 17 17 18 17 16 15 16 15 14 13 12 12 11 10  9  8  7  6  6
> [49]  6  6  6  5  4  3  (2)  3  3  3  3  3  2  1  0  1
> 
> max.mismatchs
> [1]  0.2  0.4  0.6  0.8  1.0  1.2  1.4  1.6  1.8  (2.0)  2.2  2.4  2.6
> 2.8  3.0  3.2  3.4  3.6  3.8
> [20]  4.0  4.2  4.4  4.6  4.8  5.0  5.2  5.4  5.6  5.8  6.0  6.2  6.4
> 6.6  6.8  7.0  7.2  7.4  7.6
> [39]  7.8  8.0  8.2  8.4  8.6  8.8  9.0  9.2  9.4  9.6  9.8 10.0 10.2
> 10.4 10.6 10.8 11.0 11.2 11.4
> [58] 11.6 11.8 12.0 12.2 12.4 12.6 12.8
> 
> when the function find a distance < = the corresponding mismatch. see
> (2) and (2.0), the function stops.

Yes.

> but i think the distance between those 10bp kmer should be 4, not 2
> 
> CAAGATC     AAG
>    AGATCGGAAG

You are correct about the edit-distance between the 2 strings of length
10, but that is not relevant here.  trimLRPatterns is based on

?`lowlevel-matching`:

     If 'with.indels' is 'TRUE', then the "edit distance" is used: for
     each position specified in 'at', P is compared to all the
     substrings S' of S starting at this position and the smallest
     distance is returned.

This needs to be read as applying, inverted, to 'ending at' situations.

In your case, _this position_ is the *end* of the the 10-letter subject S.
The 10-letter pattern P is compared to all the substrings S' of S *ending*
at the specified position.  The smallest distance, 2, is the one between P
and the 8-letter suffix S' of S.  Now it is possible that longer suffixes
S' of S would have the same distance 2 from P, and the *longest* of those
would be desirable.  I believe the low-level matching code actually finds
this longest match (without any mechanism, at present, to report it).