[BioC] Trimming of partial adaptor sequences

Wed Jul 24 07:34:06 CEST 2013

Hi Sean,

On 07/23/2013 03:26 PM, Taylor, Sean D wrote:
> Thanks all for your input. I will give some of these solutions a try and will probably go with whichever is faster and integrates well with the remainder of our analyses in R.
>
> Herve, thanks for the clarification. Looking back I see that I hadn't explored the function fully. I think it will work, but I have a few follow-up questions. Here is the sample data set that I'm playing around with:
>
> library(Rsamtools)
> bamfile <- system.file("extdata", "ex1.bam", package="Rsamtools")
> param <- ScanBamParam(what=c("seq", "qual"))
> gal <- readGappedAlignments(bamfile, use.names=TRUE, param=param)
>
> These sequences are all relatively short, about 35nts each. I set the first read as my "adaptor" and pulled out all the reads that have a start position overlapping the 'adaptor' sequence:
>
> gal<-gal[which(start(gal)<=width(gal[1]))]
> reads <- setNames(mcols(gal)$seq, names(gal))
> seqnames<-seqnames(gal)
> adaptor<-reads[[1]]
>
> The data are from two 'chromosomes', seq1 and seq2. My adaptor is from seq1, so I expect it to overlap on all seq1, but not on seq2. This lets me approximate best and worst case scenarios:
>
> seq1<-reads[seqnames=='seq1']
> seq2<-reads[seqnames=='seq2']
>
>> adaptor
>    36-letter "DNAString" instance
> seq: CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG
>
>> seq1
>    A DNAStringSet instance of length 16
>       width seq                                                names
>   [1]    36 CACTAGTGGCTCATTGTAAATGTGTGGTTTAACTCG               B7_591:4:96:693:509
>   [2]    35 CTAGTGGCTCATTGTAAATGTGTGGTTTAACTCGT                EAS54_65:7:152:36...
>   [3]    35 AGTGGCTCATTGTAAATGTGTGGTTTAACTCGTCC                EAS51_64:8:5:734:57
>   [4]    36 GTGGCTCATTGTAATTTTTTGTTTTAACTCTTCTCT               B7_591:1:289:587:906
>   [5]    35 GCTCATTGTAAATGTGTGGTTTAACTCGTCCATGG                EAS56_59:8:38:671...
>   ...   ... ...
> [12]    36 GTGGTTTAACTCGTCCATGGCCCAGCATTAGGGAGC               B7_591:3:188:662:155
> [13]    35 TTTAACTCGTCCATGGCCCAGCATTAGGGATCTGT                EAS56_59:2:225:60...
> [14]    35 TTAACTCGTCCATGGCCCAGCATTAGGGAGCTGTG                EAS51_66:7:328:39...
> [15]    35 AACTCGTCCATGGCCCAGCATTAGGGAGCTGTGGA                EAS51_64:5:257:96...
> [16]    35 GTACATGGCCCAGCATTAGGGAGCTGTGGACCCCG                EAS54_61:4:143:69...
>
>> seq2
>    A DNAStringSet instance of length 25
>       width seq                                                names
>   [1]    36 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTAA               B7_591:8:4:841:340
>   [2]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA                EAS54_67:4:142:94...
>   [3]    35 TTCAAATGAACTTCTGTAATTGAAAAATTCATTTA                EAS54_67:6:43:859...
>   [4]    35 TCAAATGAACTTCTGTAATTGAAAAATTCATTTAA                EAS1_93:2:286:923...
>   [5]    35 AATGAACTTCTGTAATTGAAAAATTCATTTAAGAA                EAS1_99:8:117:578...
>   ...   ... ...
> [21]    35 AAATTCATTTAAGAAATTACAAAATATAGTTGAAA                EAS54_65:8:305:81...
> [22]    35 AATTCATTTAAGAAATTACAAAATATAGTTGAAAG                EAS114_26:7:13:17...
> [23]    35 CATTTAAGAAATTACAAAATATAGTTGAAAGCTCT                EAS56_63:7:34:334...
> [24]    35 TTAAGAAATTACAAAATATAGTTGAAAGCTCTAAC                EAS114_45:3:32:13...
> [25]    40 TAAGAAATTACAAAATATAGTTGAAAGCTCTAACAATAGA           EAS139_19:5:70:31...
>
> If I use the basic parameters for trimLRPatterns on seq1:
>
>> trimmed_reads1<-trimLRPatterns(Lpattern=adaptor, subject=seq1)
>
> My expected sequence widths after trimming are:
>> width(seq1)-(length(adaptor)-start(gal)[which(seqnames=='seq1')]+1)
>   [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 34
>
> Actually trimmed sequence widths:
>> width(trimmed_reads1)
>   [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
>
> Here, the 4th sequence isn't trimmed because of some mismatches in the sequence. I expect that my data will have some mismatches, so I need to be able to work with those. So I tried setting the max.Lmismatch=5 to allow for up to 5 mismatches, but got no trimming:
>
>> trimmed_reads2<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=5)
>> width(trimmed_reads2)
>   [1]  0 35 35 36 35 35 36 35 35 35 35 36 35 35 35 35

Very strange indeed! I initially thought it was a bug and I was about
to commit a fix but then I double-checked the man page and discovered
the following "feature":

max.Lmismatch: Either an integer vector of length ‘nLp =
           nchar(Lpattern)’ representing an absolute number of
           mismatches (or edit distance if ‘with.Lindels’ is ‘TRUE’) or
           a single numeric value in the interval ‘[0, 1)’ representing
           a mismatch rate when aligning terminal substrings (suffixes)
           of ‘Lpattern’ with the beginning (prefix) of ‘subject’
           following the conventions set by ‘neditStartingAt’,
           ‘isMatchingStartingAt’, etc.

           When ‘max.Lmismatch’ is ‘0L’ or a numeric value in the
           interval ‘[0, 1)’, it is taken as a "rate" and is converted
           to ‘as.integer(1:nLp * max.Lmismatch)’, analogous to agrep
           (which, however, employs ‘ceiling’).

           Otherwise, ‘max.Lmismatch’ is treated as an integer vector
           where negative numbers are used to prevent trimming at the
           ‘i’-th location. When an input integer vector is shorter than
           ‘nLp’, it is augmented with enough ‘-1’s at the beginning to
           bring its length up to ‘nLp’. Elements of ‘max.Lmismatch’
           beyond the first ‘nLp’ are ignored.

I know, all this is a little bit tedious to read and not intuitive
at all (at least to me, but I'm probably not the only one), but
basically it means that if you want to allow up to 5 mismatches,
you need to specify a vector of length 'Lpattern' and filled with
the value 5:

   > trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=rep(5, 
length(adaptor)))
     A DNAStringSet instance of length 16
        width seq                                              names 

    [1]     0 
B7_591:4:96:693:509
    [2]     1 T 
EAS54_65:7:152:36...
    [3]     3 TCC 
EAS51_64:8:5:734:57
    [4]     5 TCTCT 
B7_591:1:289:587:906
    [5]     7 TCCATGG 
EAS56_59:8:38:671...
    ...   ... ...
   [12]    23 TCCATGGCCCAGCATTAGGGAGC 
B7_591:3:188:662:155
   [13]    26 TCCATGGCCCAGCATTAGGGATCTGT 
EAS56_59:2:225:60...
   [14]    27 TCCATGGCCCAGCATTAGGGAGCTGTG 
EAS51_66:7:328:39...
   [15]    29 TCCATGGCCCAGCATTAGGGAGCTGTGGA 
EAS51_64:5:257:96...
   [16]    27 CCCAGCATTAGGGAGCTGTGGACCCCG 
EAS54_61:4:143:69...

Unfortunately this feature has been in trimLRPatterns() since the
function was added to Biostrings 4 years ago so I guess I shouldn't
touch it. Let's just say that "if it's documented then it's not a
bug" and try to find some reconfort with that :-/

H.

>
> In my experimenting, I had gone straight to this and saw that it didn't trim partial sequence with these settings, so I had concluded that it wouldn't do the job. After playing with it, I tried adjusting trimLRPatterns=0.5 to allow 50% mismatches (if I understand that correctly) and this time get too much trimming on the last sequence:
>
>> trimmed_reads3<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.5)
>> width(trimmed_reads3)
>   [1]  0  1  3  5  7 11 12 13 16 20 20 23 26 27 29 27
>
> I tried setting max.Lmismatch =0.1 but got too little trimming again:
>
>> trimmed_reads4<-trimLRPatterns(Lpattern=adaptor, subject=seq1, max.Lmismatch=0.1)
>> width(trimmed_reads4)
>   [1]  0  1  3 35  7 11 12 13 16 20 20 23 26 27 29 34
>
>   I could probably optimize until I got it just right, but with a large data set that might not be too practical.
>   If I try these same settings against seq2, where there is not supposed to be overlap with the adaptor, I get erroneous trimming with mismatch =0.5. With mismatch = 0.1, it is better, but it decides to trim off the first nucleotide on a few of them:
>
>> trimmed_reads5<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.5)
>> trimmed_reads6<-trimLRPatterns(Lpattern=adaptor, subject=seq2, max.Lmismatch=0.1)
>> width(seq2)
>   [1] 36 35 35 35 35 35 35 35 36 35 35 35 35 35 35 35 35 35 35 35 35 35 35 35 40
>> width(trimmed_reads5)
>   [1] 27 26 26 27 11 26 26 26 14 13 13 13 28 28 31 23 27 28 29 27 28 29 33 27 40
>> width(trimmed_reads6)
>   [1] 36 35 35 35 35 35 35 35 36 35 35 35 34 34 35 35 34 35 35 35 35 35 35 35 40
>
> So what is the best way to optimize this? Obviously you are trying to balance recognizing the whole adaptor sequence without trimming off too many false positives. Also, can I set a limit that requires to have at least n bases overlap with the adaptor suffix?
>
> Thanks for your help (again)!
> Sean
>
>
>> -----Original Message-----
>> From: Hervé Pagès [mailto:hpages at fhcrc.org]
>> Sent: Monday, July 22, 2013 5:37 PM
>> To: Taylor, Sean D
>> Cc: bioconductor at r-project.org
>> Subject: Re: Trimming of partial adaptor sequences
>>
>> Hi Sean,
>>
>> On 07/22/2013 01:02 PM, Taylor, Sean D wrote:
>>> We have been experimenting with a NGS protocol in which we insert
>>> sheared genomic fragments into a custom plasmid for sequencing on an
>>> Illumina MiSeq instrument. The insertion site of this plasmid is
>>> flanked by our own custom barcodes (N7) and ~80 nt Illumina-based
>>> adaptor sequence. We then PCR out the insert with barcodes and
>>> adaptors for sequencing. Our adaptor sequence is similar to the
>>> Illumina adaptor, but we use custom primer binding sites. We are not
>>> sure if the Illumina software will be able to recognize and trim our
>>> custom adaptors. We are trying to figure out the best way to trim read
>> through into the 3'
>>> adaptor ourselves.  We have roughly three scenarios:
>>>
>>> (1) The insert is long enough that we have no read through
>>>
>>> (2) The vector is empty, in which case the entire adaptor sequence is
>>> present
>>>
>>> (3) The insert is long enough to have useful data, but we get
>>> read-through into the 3' adaptor sequence that must be trimmed.
>>>
>>> The solution we are currently working on is to identify the minimal
>>> sequence that is recognizable as the adaptor sequence and trim that
>>> using trimLRPatterns() in the Biostrings package.  Ideally we would
>>> like it if we could give trimLRPatterns() the entire adaptor sequence
>>> and have it recognize it on our reads even if it is only partially present.
>>
>> May be I misunderstand what you are trying to do exactly but yes, you can
>> give the entire adaptor sequence to trimLRPatterns() and it will recognize it
>> on our reads even if it's only partially present:
>>
>>     library(Biostrings)
>>
>>     adaptor <- DNAString("ACCAGGACAA")  # entire adaptor
>>     reads <- DNAStringSet(c(
>>       "GACAATTTATTT", # adaptor partially present on the left
>>       "CAATTTATTTGC", # adaptor partially present on the left
>>       "TTTATTTACCAG", # adaptor partially present on the right
>>       "CAATTTTTTACC"  # adaptor partially present on both ends
>>     ))
>>
>> Then:
>>
>>     > trimLRPatterns(Lpattern=adaptor, Rpattern=adaptor, subject=reads)
>>       A DNAStringSet instance of length 4
>>         width seq
>>     [1]     7 TTTATTT
>>     [2]     9 TTTATTTGC
>>     [3]     7 TTTATTT
>>     [4]     6 TTTTTT
>>
>> Note that trimLRPatterns() expects that, when the adaptor is partially
>> present on the left (resp. right), what's present is a suffix (resp.
>> prefix) of the adaptor, and not an arbitrary substring of it. Is it what you
>> expect too?
>>
>> Thanks,
>> H.
>>
>>> However, in my experimenting it did not seem to be able to this. I
>>> thought I would ask the Bioconductor community if there are any better
>>> solutions to recognizing and trimming partial adaptor sequences.
>>>
>>> Thanks in advance for any input.
>>>
>>> Sean Taylor
>>>
>>> Post-doctoral Fellow
>>>
>>> Fred Hutchinson Cancer Research Center
>>>
>>> 206-667-5544
>>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319