[R] Question About Repeat Random Sampling from a Data Frame

Mon Dec 21 17:23:43 CET 2009

On Dec 21, 2009, at 10:12 AM, Adam Carr wrote:

> Good Morning:
>
> I've read many, many posts on the r-help system and I feel compelled  
> to quickly admit that I am relatively new to R, I do have several  
> reference books around me, but I cannot count myself among the  
> fortunate who seem to strong programming intuition.
>
> I have a data set consisting of 1637 observations of five variables:  
> tensile strength, yield strength, elongation, hardness and a  
> character indicator with three levels: (Y)es, (N)o, and (F)ail.
>
> My objective is to randomly sample various subsets from this data  
> set and then evaluate these subsets using simple parameters among  
> them tests for normality, shape and skewness. The data set is  
> ordered by the character variable prior to sampling, and the samples  
> are weighted to mirror representation in an overall, physical process.
>
> I am sampling the data set using this code:
>
> sample <- dataset[sample(1:1637, 500,  
> prob 
> = 
> c 
> (rep 
> (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
> TRUE),]
>
> What I would like to do is iterate this process to create many (say  
> 500 or more) sampled sets of n=500 and then evaluate each set for  
> the parameters of interest. I would actually be evaluating each  
> variable within each subset for my characteristic of interest. I am  
> familiar with sampling and saving single columns of data to do this  
> sort of thing, but I am not sure how to accomplish this with a  
> multiple-variable data set.
>
> For example, I am currently iterating this using a clunky process:
>
> mysamples<-list()
> for (i in 1:10){
> mysamples[[i]] <-  
> dataset 
> [ sample 
> (1 
> : 
> 1637,100 
> ,prob 
> = 
> c 
> (rep 
> (163.7/1637,513),rep(245.5/1637,197),rep(1227.8/1637,927)),replace =  
> TRUE), ]
> }
>

Using lists to store intermediate results is not considered clunky in  
R. (You might want to provide statistical justification for the  
otherwise puzzling sampling strategy.)

> But this leaves me with the additional task of defining each  
> mysample[i] iteration and converting it to a form on which I can  
> apply a standard statistical test like mean() or skewness() to the  
> variable columns within each subset. I have attempted to iteratively  
> convert these lists using this code:
>
> mat<-matrix(nrow=100,ncol=5)
> for (i in 1:length(mysamples))
> {mat[i]<-do.call('rbind',mysamples[i])}

It would help if you explained what you are attempting here in  
ordinary English. There are 10 elements in mysamples, each of which is  
a 100 x 5 dataframe, and mat is just one 100 x 5 matrix, which you  
seem to be referencing incorrectly, given the fact that it has two,  
rather than one, dimension. Furthermore, those dataframes may not be  
of a uniform class, since you said you had character variable. Do you  
really want these all in a character type matrix, which would be what  
is likely to happen given R's requirement that matrix element be of  
only one class? What you say above suggests not.

>
> but running the code generates the error message: number of items to  
> replace is not a multiple of replacement length.

Because of the way you are referencing the matrix, probably. If you  
wanted a 10 x 100 x 5 array, then create an array. In R, as far as I  
can tell anyway, matrices are necessarily of 2 dimensions. Tables and  
arrays can be of higher dimension.

> I have tried unsuccessfully, by reading many, many helpful r-help  
> emails on this error, to understand my probably obvious mistake.

Sorting out such problems is best done with smaller test objects. I  
was surprised to see that you thought it was necessary to convert  
dataframes to matrices in order to calculate descriptive statistics.  
Nothing could be farther from the truth. Furthermore, it for some  
other more valid reason you wanted a list of matrices, there is a  
perfectly good function that will convert a dataframe to a matrix,  
data.matrix(), remembering of course that if there is a single  
character variable in the dataframe, that the entire matrix will be of  
type character.
>
> Based on the small amount that I think I know about R it seems to me  
> that sampling the data frame and containing the samples in a list is  
> likely a pretty inefficient way to do this task. Any help that any  
> of you could provide to assist me in iteratively sampling the data  
> frame, and storing the samples in a form on which I can apply other  
> statistical tests would be greatly appreciated.
>
> Thank you very much for taking the time to consider my questions.
-- 

David Winsemius, MD
Heritage Laboratories
West Hartford, CT