[R] Problem to generate training data set and test data set

Jim Lemon jim at bitwrit.com.au
Tue Dec 26 01:16:20 CET 2006


Aimin Yan wrote:
> I have a full data set like this:
> 
>     aa bas    aas bms   ams bcu        acu     omega       y
> 1 ALA   0 127.71   0 69.99   0 -0.2498560  79.91470 outward
> 2 PRO   0  68.55   0 55.44   0 -0.0949008  76.60380 outward
> 3 ALA   0  52.72   0 47.82   0 -0.0396550  52.19970 outward
> 4 PHE   0  22.62   0 31.21   0  0.1270330 169.52500  inward
> 5 SER   0  71.32   0 52.84   0 -0.1312380   7.47528 outward
> 6 VAL   0  12.92   0 22.40   0  0.1728390 149.09400  inward
> ......................................................................................
> 
> 
> aa have 19 levels, and there are different number of observation for each 
> levels.
> I want to pick 75% of observations of each levels randomly to generate a 
> training set,
> and 25% of observation of each levels to generate a testing set.
> 
Hi Aimin,
I haven't tested this exhaustively, but I think it does what you want.

get.prob.sample<-function(x,prob=0.5) {
  xlevels<-levels(as.factor(x))
  xlength<-length(x)
  xsamp<-rep(FALSE,xlength)
  for(i in xlevels) {
   lengthi<-length(x[x == i])
   xsamp[sample(which(x == i),lengthi*prob)]<-TRUE
  }
  return(xsamp)
}

get.prob.sample(mydata$aa,0.75)

Jim



More information about the R-help mailing list