[R] Stratified random sampling in R?

Gabor Grothendieck ggrothendieck at myway.com
Sat Feb 21 05:30:12 CET 2004

Try this.  ptrain and ptest and proportions in the training
and test samples.  The next line generates a random test
vector of factors, f, for testing purposes.

ptrain <- 0.3; ptest <- 0.2
set.seed(1); f <- cut(runif(100),3,lab=F)

first <- function(x, p) x[seq( ceiling(p * length(x) ) )]

perms <- lapply(split( seq(f), f ), sample)

train <- lapply( perms, function(x) first(x, ptrain) )
test <- lapply( perms, function(x) first(rev(x), ptest) )

first takes a vector and a proportion and returns that proportion
of elements from the beginning of the vector.  Assuming p > 0
it always returns at least one.   perms is a random 
permutation of the cases at each level.  Finally, in the last
two statements, we take elements off the beginning of 
the permutations for our training set and off the end for 
our test set.

At the end, train and test are each lists of vectors of case
numbers representing the training and testing samples.

Date:   Fri, 20 Feb 2004 18:55:51 -0800 
From:   Jonathan Greenberg <greenberg at ucdavis.edu>
To:   R-help <r-help at stat.math.ethz.ch> 
Subject:   [R] Stratified random sampling in R? 

Is there an easy way to do a stratified random sampling based on a factor
column in R? E.g. I want to extract a random 10% of the data from dataset
for each class (so each class may have a different number of entries,
depending on its size). On a related note, if this is easily doable, is
there an easy way to extract TWO non-overlapping strat. random samples
datasets (e.g. If I want to have a training and test dataset). Thanks!

Jonathan Greenberg
Graduate Group in Ecology, U.C. Davis
AIM: jgrn307 or jgrn3007
MSN: jgrn307 at msn.com or jgrn3007 at msn.com

More information about the R-help mailing list