[R] Problem to generate training data set and test data set

Charles C. Berry cberry at tajo.ucsd.edu
Tue Dec 26 18:43:38 CET 2006


What you describe is called stratified sampling. It was discusssed last 
month (and other times) on this list:

 	http://finzi.psych.upenn.edu/R/Rhelp02a/archive/90220.html

Using

 	RSiteSearch("stratified sampling")

will produce many hits to relevant articles and packages.



On Mon, 25 Dec 2006, Aimin Yan wrote:

> I have a full data set like this:
>
>    aa bas    aas bms   ams bcu        acu     omega       y
> 1 ALA   0 127.71   0 69.99   0 -0.2498560  79.91470 outward
> 2 PRO   0  68.55   0 55.44   0 -0.0949008  76.60380 outward
> 3 ALA   0  52.72   0 47.82   0 -0.0396550  52.19970 outward
> 4 PHE   0  22.62   0 31.21   0  0.1270330 169.52500  inward
> 5 SER   0  71.32   0 52.84   0 -0.1312380   7.47528 outward
> 6 VAL   0  12.92   0 22.40   0  0.1728390 149.09400  inward
> ......................................................................................
>
>
> aa have 19 levels, and there are different number of observation for each
> levels.
> I want to pick 75% of observations of each levels randomly to generate a
> training set,
> and 25% of observation of each levels to generate a testing set.
>
> Does anyone know to do this?
>
> Thanks
>
> Aimin Yan
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Charles C. Berry                        (858) 534-2098
                                          Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	         UC San Diego
http://biostat.ucsd.edu/~cberry/         La Jolla, San Diego 92093-0717



More information about the R-help mailing list