[BioC] a problem of trimLRPatterns still confused me

Fri Nov 30 08:22:50 CET 2012

You are perfectly correct, but it is not terrible that we trimmed a little extra.  On the other
hand, there is a problem lurking here that, in opposite cases, we would not quite trim enough.

To simplify the example, we are only concerned with the last 10 letters of your subject and the
first 10 letters of Rpattern:

> substr(subject, 91, 100)
[1] "CAAGATCAAG"
> substr(Rpattern, 1, 10)
[1] "AGATCGGAAG"

Then note that
> neditEndingAt("AGATCGGAAG", "CAAGATCAAG", ending.at=10, with.indels=TRUE)
[1] 2

And your result:
> trimLRPatterns(Rpattern = "AGATCGGAAG", subject = "CAAGATCAAG", max.Rmismatch=2, with.Rindels=TRUE)
[1] ""

But you ask why, after deleting GG in the Rpattern, we don't trim only the last 8 letters of the subject
rather than all 10.  This is a valid question.

In general, with max.Rmismatch=2 and with.Rindels=TRUE, the matching suffix of the subject (ending.at=10)
could be as short as 8 letters or as long as 12.  The lower (C) level matching code does not actually
report the matching portion of the subject -- at least, it didn't long ago.  Thus, trimLRPatterns takes a
lazy out and trims the subject suffix of the same length as the matching Rpattern prefix.  Up to the
length of the pattern prefix, this has the advantage of always removing any possible adapter even if more
is removed than necessary, as you describe here.  If the match used 2 insertions in the pattern, rather
than deletions, so was 12 letters long, then we aren't removing all possible adapter, and this will hurt
or prevent later alignment.  One might say that we should trim 12 in all cases, even yours, and I have
proposed that.

Thanks, and let us know if I didn't get your point.

On Nov 29, 2012, at 11:17 PM, Wang Peter wrote:

> sorry to disturb you again
> 
> but i am still feeling confused
> see this problem
> 
> subject = "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTCACGGTCTGTATTCGTACTGAAAATCAAGATCAAG"
> 
> Rpattern = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG"
> 
> max.mismatchs <- 0.2*1:nchar(DNAString(Rpattern))
> 
> trimLRPatterns(Rpattern = Rpattern, subject = subject,
> max.Rmismatch=max.mismatchs, with.Rindels=TRUE)
> [1] "GGTAACTTTTCTGACACCTCCTGCTTAAAACCCCAAAGGTCAGAAGGATCGTGAGGCCCCGCTTTCACGGTCTGTATTCGTACTGAAAAT"
> 
> CAAGATC  AAG
>   AGATCGGAAGAGCACACGTCTGAACTCCAGTCACCAGATCATCTCGTATGCCGTCTTCTGCTTG
> 
> overlap length is 10 bp, so the allowed distance is 10*0.2=2
> 
> so it should trim the "AGATCAAG", not include the "CA"
> 
> I am confused why?
> 
> 
> 
> -- 
> shan gao
> Room 231(Dr.Fei lab)
> Boyce Thompson Institute
> Cornell University
> Tower Road, Ithaca, NY 14853-1801
> Office phone: 1-607-254-1267(day)
> Official email:sg839 at cornell.edu
> Facebook:http://www.facebook.com/profile.php?id=100001986532253
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor