[BioC] Fisher-Yates algorithm for DNA shuffling ?

Wed Jun 24 01:05:34 CEST 2009

Hi Robert,

Robert Castelo wrote:
> dear list,
> 
> i'm interested in using some proper DNA shuffling procedure like the
> Fisher-Yates algorithm:
> 
> R. A. Fisher and F. Yates, Example 12, Statistical Tables, London, 1938.
> 
> Richard Durstenfeld, Algorithm 235: Random permutation, CACM 7(7):420,
> July 1964.

The Fisher & Yates or Durstenfeld algos are just particular implementations
of a ramdom permutation generator.
R has the sample() function to generate a random permutation. The man page
doesn't specify what implementation is used but does that really matter?

> 
> i've been googling bioconductor and the r-project for this, and
> particularly looking at the Biostrings package, but I've failed to find
> anything. just in case i'm missing some implementation of this DNA
> shuffling in R, i'd like to ask the list whether anybody knows of that
> procedure being implemented in R or, even better, in a bioconductor
> package such that would work with DNAString objects.
> 
> the reason for that is that i would like to simulate random DNA strings
> preserving nucleotide composition in order to calculate empirical
> p-values for motif scanning.

Generate a random DNA string (of length 1000) with specific
nucleotide probabilities:

   x <- paste(sample(c("A", "C", "G", "T"), 1000, replace=TRUE,
              prob=c(0.2, 0.55, 0.1, 0.15)), collapse="")
   x <- DNAString(x)

Then:

   > alphabetFrequency(x, baseOnly=TRUE)
       A     C     G     T other
     214   531   110   145     0

If you really want to shuffle a given DNAString object x0 in order to
preserve its nucleotide composition:

   x1 <- x0[sample(length(x0))]

For example, shuffling the 'x' obtained above gives:

   x1 <- x[sample(length(x))]

   > alphabetFrequency(x1, baseOnly=TRUE)
       A     C     G     T other
     214   531   110   145     0

Performance:

   > system.time(for (i in 1:10000) {y <- x[sample(length(x))]})
      user  system elapsed
     5.760   0.008   5.764

Everything is already here and probably fast enough. No need to
re-implement anything.

Cheers,
H.

> 
> if it is not yet implemented i would like to suggest the package
> maintainers of Biostrings or any other biological sequence
> infrastructure package in Bioconductor to have it implemented in C for
> having maximum speed with this kind of simulations. just in case it
> eases my request here goes some simple R code implementing the
> Fisher-Yates shuffling algorithm:
> 
> fy <- function(seq) {
>   seq <- unlist(strsplit(seq, ""))
>   n <- length(seq)
>   i <- n
>   while (i > 0) {
>     j <- floor(runif(1)*i+1)
>     if (i != j) {
>       tmp <- seq[i]
>       seq[i] <- seq[j]
>       seq[j] <- tmp
>     }
>     i <- i - 1
>   }
>   paste(seq, collapse="")
> }
> 
> and this is an example of how to use it:
> 
> # generate some random DNA sequence of length 3
> seq <- paste(sample(c("A","C","G","T"), size=3, replace=TRUE), collapse="")
> seq
> [1] "ATG"
> 
> # shuffle the sequence 10,000 times and store all the permutations
> shuffledseqs <- c()
> for (i in 1:10000) shuffledseqs <- c(shuffledseqs, fy(seq))
> 
> # verify that indeed the permutations occur uniformly at random
> base::table(shuffledseqs)/10000
> shuffledseqs
>    AGT    ATG    GAT    GTA    TAG    TGA 
> 0.1693 0.1682 0.1678 0.1629 0.1610 0.1708 
>  1/6
> [1] 0.1666667
> 
> 
> thanks!
> robert.
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319