[BioC] FastqSampler sort by ID

Martin Morgan mtmorgan at fhcrc.org
Wed Nov 14 17:55:06 CET 2012


Hi Marcus --

On 11/13/2012 3:11 PM, Marcus Davy wrote:
> Hi Martin,
> when the FastqSampler uniformly samples from Fastq files, it appears that the
> header information in the id() accessor of a ShortReadQ object is not sorted in
> the sample with respect to tile information, as you would expect in the original
> source files.
>
> I was wondering whether an argument (e.g. sortByID=TRUE/FALSE) would be worth
> implementing in FastqSampler to specify whether to sort the sampled reads by
> their IDs rather than sorting manually after taking a sample?
>
> cheers,
>
> Marcus
>
>
> ## Using ShortRead example
> sp <- SolexaPath(system.file('extdata', package='ShortRead'))
> fl <- file.path(analysisPath(sp), "s_1_sequence.txt")
>
> ## Load Fastq data to change id information
> rfq <- readFastq(fl)
>
> ## No tile information in example
> id(rfq)

actually, from ?readFastq, argument withIds=TRUE will return ids.

In ShortRead 1.17.4 (in the devel branch) I've added an 'ordered' argument to 
FastqSampler, with default FALSE but when set to TRUE returns the reads in the 
order encountered in the file.

Thanks for the suggestion.

Martin

>
> ## Make up some new headers in the form;
> ## @HWI-EAS88:187:C0101VACXX:2:2206:9399:51464 1:N:0:
> tile <- rep(c(1001:(1001+7)), each=length(rfq)/8)
> x    <- gsub(".+_(\\d+)_\\d+$", "\\1", id(rfq))
> y    <- gsub(".+_\\d+_(\\d+)$", "\\1", id(rfq))
> ids  <- paste("@HWI-EAS88:187:C0101VACXX:2:", tile, ":", x, ":", y, " 1:N:0:",
> sep="")
>
> ## Hack overlaying ids with additional tile information
> rfq at id  <- BStringSet(ids)
>
> ## Write to a tmp file
> writeFastq(rfq, "/tmp/example.fastq")
> f <- open(FastqSampler("/tmp/example.fastq", 50))
>
> ## Sample
> set.seed(42)
> fqSample <- yield(f)
> close(f)
>
> ## Tile information is randomly ordered
> id(fqSample)
> sapply(strsplit(as.character(id(fqSample)), ":"), function(x)x[5])
>
> ## Manually sorting the ids
> o <- srorder(id(fqSample))
> fqSorted <- fqSample[o]
> id(fqSorted)
>
>  > sessionInfo()
> R version 2.15.0 (2012-03-30)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] ShortRead_1.16.1     latticeExtra_0.6-24  RColorBrewer_1.0-5
> [4] Rsamtools_1.10.1     lattice_0.20-10      Biostrings_2.26.2
> [7] GenomicRanges_1.10.2 IRanges_1.16.2       BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.18.0  bitops_1.0-4.1  grid_2.15.0     hwriter_1.3
> [5] parallel_2.15.0 stats4_2.15.0   tools_2.15.0    zlibbioc_1.4.0


-- 
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109



More information about the Bioconductor mailing list