[BioC] Shortread and filtering of duplicate reads

Tue Jan 19 02:08:45 CET 2010

Hi Johannes --

Johannes Waage wrote:
> Hi all,
> 
> Does anyone know if the shortRead package has functionality to filter out
> duplicate reads, but only reads with more than n duplicates, to avoid reads
> stacks caused by PCR-aplification? I can only find srduplicated(), but it
> doesn't seem to have functionality for specifiying n duplicate reads.

I don't think there's a built-in function. This

  f <- function(x, n)
  {
      r <- srrank(x)
      t <- tabulate(r)
      r %in% which(t >= n)
  }

returns a logical vector indicating that the reads occur >= n times, so

  aln[!f(sread(aln), 5)]

would drop the reads occurring 5 or more times (one might want to think
about whether the reads need to map to the same location, too).

Martin

> 
> Thanks in advance!
> 
> Regards,
> JW,
> Uni. of Copenhagen
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793