[R] Selecting subsamples

Richard A. O'Keefe ok at cs.otago.ac.nz
Fri Dec 5 04:05:48 CET 2003


christian_mora at vtr.net wrote
    [that he has a data set with 9 variables (columns) measured on 2000
     individuals (rows) and wants a sample] in which the sum of the
    volume of the individuals in that sample >= 100 cubic m.

Let's suppose that this information is held in d, a data frame, and that
the volume column is d$vol.

If sum(d$vol) < 100, there is no sample which satisfies your condition.
If sum(d$vol) >= 100, then d is such a sample as it stands.

If you want the smallest number of rows, then

    indices <- order(d$vol, decreasing=TRUE)

gives you the row indices sorted by decreasing volume;

    d$vol[indices]	=> the volumes in decreasing order
    cumsum(")           => the cumulative sum
    sum(" < 100.0)	=> 1 less than then number of rows you want

so

    indices <- order(d$vol, decreasing=TRUE)
    d[indices[1:(sum(cumsum(d$vol[indices]) < 100.0) + 1)]]

should be the answer you want.

This is O(n.lg n) where n is the number of rows; in your case n is 2000.

If you don't need the smallest sample, but just any old haphazard answer,

    indices <- sample(nrow(d))
    d[indices[1:(sum(cumsum(d$vol[indices]) < 100.0) + 1)]]

should be useful.




More information about the R-help mailing list