[R] Question About Repeat Random Sampling from a Data Frame

Mon Dec 21 16:20:58 CET 2009

On Mon, Dec 21, 2009 at 4:12 PM, Adam Carr <adamlcarr at yahoo.com> wrote:
> Good Morning:
>
> I've read many, many posts on the r-help system and I feel compelled to quickly admit that I am relatively new to R, I do have several reference books around me, but I cannot count myself among the fortunate who seem to strong programming intuition.
>
> I have a data set consisting of 1637 observations of five variables: tensile strength, yield strength, elongation, hardness and a character indicator with three levels: (Y)es, (N)o, and (F)ail.
>
> My objective is to randomly sample various subsets from this data set and then evaluate these subsets using simple parameters among them tests for normality, shape and skewness. The data set is ordered by the character variable prior to sampling, and the samples are weighted to mirror representation in an overall, physical process.
>
> I am sampling the data set using this code:
>
> sample <- dataset[sample(1:1637, 500, prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE),]
>
> What I would like to do is iterate this process to create many (say 500 or more) sampled sets of n=500 and then evaluate each set for the parameters of interest. I would actually be evaluating each variable within each subset for my characteristic of interest. I am familiar with sampling and saving single columns of data to do this sort of thing, but I am not sure how to accomplish this with a multiple-variable data set.
>
> For example, I am currently iterating this using a clunky process:
>
> mysamples<-list()
> for (i in 1:10){
> mysamples[[i]] <- dataset[ sample(1:1637,100,prob=c(rep(163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace = TRUE), ]
> }
>
> But this leaves me with the additional task of defining each mysample[i] iteration and converting it to a form on which I can apply a standard statistical test like mean() or skewness() to the variable columns within each subset. I have attempted to iteratively convert these lists using this code:
>
> mat<-matrix(nrow=100,ncol=5)
> for (i in 1:length(mysamples))
> {mat[i]<-do.call('rbind',mysamples[i])}
>
> but running the code generates the error message: number of items to replace is not a multiple of replacement length. I have tried unsuccessfully, by reading many, many helpful r-help emails on this error, to understand my probably obvious mistake.
>
> Based on the small amount that I think I know about R it seems to me that sampling the data frame and containing the samples in a list is likely a pretty inefficient way to do this task. Any help that any of you could provide to assist me in iteratively sampling the data frame, and storing the samples in a form on which I can apply other statistical tests would be greatly appreciated.
>
> Thank you very much for taking the time to consider my questions.
>
> Adam
>
>
>
>        [[alternative HTML version deleted]]

That's pretty much how I tend to do those things. what you seem to be
missing is the ?apply family:

mysamples.means<-lapply(mysamples,function(x)mean(x[,1]))

Hope that gets you on your way. If you want more help, I'd suggest
including an example data set in your follow-up messages.

/Gustaf

-- 
Gustaf Rydevik, M.Sci.
tel: +46(0)703 051 451
address:Essingetorget 40,112 66 Stockholm, SE
skype:gustaf_rydevik