[BioC] Trimming of partial adaptor sequences

Tue Jul 23 01:06:48 CEST 2013

This is an answer I sent off-list to Sean, but I am sending to the list
as some other responses may indicate this is topical:

We recently published 'Kraken', a suite of tools for NGS processing (the
umpteenth, indeed).
It has extensive facities for adapter trimming though, as we do a lot
of small RNA sequencing, and adpters are often present but sometimes
partially.
The tool that does this is reaper. It has a mechanism where
it recognises successively weaker types of matches, all under control
of the user.

1)    good match; a stretch of N nucletodides matches with at most
      E edit distance and optionally G gaps between adapter and read
      (N,E,G user-controlled)

2)    prefix match; similar as above, but between the end of the
      read and the start of the adapter. Separate parameters
      are provided for this.

3)    exact head-to-tail match, for example where the last two or three
      bases of the read match the adapter start. This is user-controlled
      too.

reaper will read (optionally gzipp'ed) fastq and output fastq, and has
many more filtering options (e.g. quality, low-complexity). It can do
somewhere between 60M-200M reads per hour, depending on read length.

More information:

   http://www.sciencedirect.com/science/article/pii/S1046202313002399

   http://www.ebi.ac.uk/research/enright/software/kraken

   ftp://ftp.ebi.ac.uk/pub/contrib/enrightlab/kraken/reaper/src/reaper-latest/doc/reaper.html

best,
Stijn

On Mon, Jul 22, 2013 at 08:02:24PM +0000, Taylor, Sean D wrote:
> We have been experimenting with a NGS protocol in which we insert sheared genomic fragments into a custom plasmid for sequencing on an Illumina MiSeq instrument. The insertion site of this plasmid is flanked by our own custom barcodes (N7) and ~80 nt Illumina-based adaptor sequence. We then PCR out the insert with barcodes and adaptors for sequencing. Our adaptor sequence is similar to the Illumina adaptor, but we use custom primer binding sites. We are not sure if the Illumina software will be able to recognize and trim our custom adaptors. We are trying to figure out the best way to trim read through into the 3' adaptor ourselves.  We have roughly three scenarios:
> 
> (1) The insert is long enough that we have no read through
> (2) The vector is empty, in which case the entire adaptor sequence is present
> (3) The insert is long enough to have useful data, but we get read-through into the 3' adaptor sequence that must be trimmed.
> 
> The solution we are currently working on is to identify the minimal sequence that is recognizable as the adaptor sequence and trim that using trimLRPatterns() in the Biostrings package.  Ideally we would like it if we could give trimLRPatterns() the entire adaptor sequence and have it recognize it on our reads even if it is only partially present. However, in my experimenting it did not seem to be able to this. I thought I would ask the Bioconductor community if there are any better solutions to recognizing and trimming partial adaptor sequences.
> 
> Thanks in advance for any input.
> 
> Sean Taylor
> Post-doctoral Fellow
> Fred Hutchinson Cancer Research Center
> 206-667-5544
> 
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn