[BioC] FastqSampler not sampling correctly in ShortRead

Martin Morgan mtmorgan at fhcrc.org
Fri Apr 27 21:23:28 CEST 2012


On 04/27/2012 12:07 PM, Marcus Davy wrote:
> Hi Martin,
> thanks for the prompt bug fix.
>
> I was wondering if there is a mechanism to allow a seed for reproducible
> random sample generation, and if so how that might be implemented,
> set.seed() or otherwise.

Hi Marcus --

FastqSampler uses R's random number stream so after example(FastqSampler)

 > f = open(FastqSampler(fl, 50))
 > set.seed(123); s1 = yield(f); set.seed(123); s2 = yield(f)
 > identical(s1, s2)
[1] TRUE
 > packageVersion("ShortRead")
[1] '1.14.3'

Martin

>
> cheers,
>
> Marcus
>
>
> On Wed, Apr 25, 2012 at 11:43 PM, Martin Morgan <mtmorgan at fhcrc.org
> <mailto:mtmorgan at fhcrc.org>> wrote:
>
>     Thanks Marcus, this serious bug was introduced in ShortRead 1.12.3 /
>     1.13.9 on Dec 4/5 2011. It is fixed in ShortRead 1.4.3 / 1.5.4.
>     Records 2:n were actually records 2:n in the file.
>
>     FWIW in case there is confusion the intended way to draw independent
>     samples is just
>
>       f = open(FastqSample(fl, 50)
>       s1 = yield(f); s2 = yield(f)
>       close(f)
>
>     Martin
>
>
>     On 04/25/2012 12:50 AM, Marcus Davy wrote:
>
>         Hi,
>         there appears to be a problem in FastqSampler() which seems to
>         sample the
>         first read at random, but then the next n-1 reads
>         are always the same, so the next sample obtained with yield() is not
>         independent. Is this a bug, or my code implementation wrong?
>
>         ## From example(FastqSampler)
>         sp<- SolexaPath(system.file('__extdata', package='ShortRead'))
>         fl<- file.path(analysisPath(sp), "s_1_sequence.txt")
>
>         f<- FastqFile(fl)
>         rfq<- readFastq(f)
>
>         set.seed(1)
>         f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
>         s1<- yield(f)    # sample of size n=50
>         set.seed(2)
>         f<- FastqSampler(fl, 50, readerBlockSize=1e8, verbose=TRUE)
>         s2<- yield(f)    # independent sample of size 50
>
>
>         ## Intersection length between samples always 50-1
>         length(intersect(id(s1), id(s2)))
>
>         ## First read is different, the next 2:n reads are the same
>         head(sread(s1))
>         head(sread(s2))
>         close(f)
>
>             head(sread(s1))
>
>            A DNAStringSet instance of length 6
>              width seq
>         [1]    36 GTCTGGAAACGTACGGATTGTTCAGTAACT__TTACTC # Different
>         [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same
>         [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ...
>         [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ...
>         [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ...
>         [6]    36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG Same
>
>             head(sread(s2))
>
>            A DNAStringSet instance of length 6
>              width seq
>         [1]    36 GTTTACGAATTAAATCGAAGTGGACTGCTG__GGGGGA # Different
>         [2]    36 GATTTCTTACCTATTAGTGGTTGAACAGCA__TCGGAC #Same
>         [3]    36 GCGGTGGTCTATAGTGTTATTAATATCAAT__TTGGGT # ...
>         [4]    36 GTTACCATGATGTTATTTCTTCATTTGGAG__GTAAAA # ...
>         [5]    36 GTATGTTTCTCCTGCTTATCACCTTCTTGA__AGGCTT # ...
>         [6]    36 GTTCTCTAAAAACCATTTTTCGTCCCCTTC__GGGGCG #Same
>
>         Marcus
>
>             sessionInfo()
>
>         R version 2.15.0 (2012-03-30)
>         Platform: x86_64-apple-darwin9.8.0/x86___64 (64-bit)
>
>         locale:
>         [1] C
>
>         attached base packages:
>         [1] stats     graphics  grDevices utils     datasets  methods   base
>
>         other attached packages:
>         [1] ShortRead_1.14.2    latticeExtra_0.6-19 RColorBrewer_1.0-5
>         [4] Rsamtools_1.8.4     lattice_0.20-6      Biostrings_2.24.1
>         [7] GenomicRanges_1.8.3 IRanges_1.14.2      BiocGenerics_0.2.0
>
>         loaded via a namespace (and not attached):
>         [1] Biobase_2.16.0 bitops_1.0-4.1 grid_2.15.0    hwriter_1.3
>           stats4_2.15.0
>         [6] tools_2.15.0   zlibbioc_1.2.0
>
>                 [[alternative HTML version deleted]]
>
>         _________________________________________________
>         Bioconductor mailing list
>         Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>         https://stat.ethz.ch/mailman/__listinfo/bioconductor
>         <https://stat.ethz.ch/mailman/listinfo/bioconductor>
>         Search the archives:
>         http://news.gmane.org/gmane.__science.biology.informatics.__conductor
>         <http://news.gmane.org/gmane.science.biology.informatics.conductor>
>
>
>
>     --
>     Computational Biology
>     Fred Hutchinson Cancer Research Center
>     1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109
>
>     Location: M1-B861
>     Telephone: 206 667-2793 <tel:206%20667-2793>
>
>


-- 
Computational Biology
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109

Location: M1-B861
Telephone: 206 667-2793



More information about the Bioconductor mailing list