[R] sampling question

Fri Jun 29 00:19:46 CEST 2007

Lets assume your zcta data looks like this

    set.seed(12345) ## temporary for reproducibility
    zcta <- data.frame( zipcode=LETTERS[1:5], prop=runif(5) )
    zcta
    zipcode      prop
1       A 0.7209039
2       B 0.8757732
3       C 0.7609823
4       D 0.8861246
5       E 0.4564810

This says that 72.1% of the population in zipcode A is female, ..., and 
45.6% in zipcode E is female.

Now suppose you sampled 20 people and you recorded the zipcode (and 
other variables) and stored in 'samp'

    samp <- data.frame( id=1:20,
                        zipcode=LETTERS[ sample(1:5, 20, replace=TRUE) ])

Now, I am not sure what you want to do. But I could see two possible 
meanings from your message.

1) If you want to sample 10 observation, with each observation weighted 
INDEPENDENTLY by the proportion of women in its zipcode, try something 
like the following. The problem with this option is that it depends on 
the prevalence of the zipcodes of the observations.

    comb <- merge( samp, zcta, all.x=T )
    comb <- comb[ order(comb$id), ]
    comb[ sample( comb$id, 10, prob=comb$prop ), ]

2) If you want to sample x% in each zipcode, where x is the proportion 
of women in that zipcode. Then this is what I would call stratified 
sampling. Try this:

    tmp <- split( samp, samp$zipcode )
    out <- NULL

    for( z in names(tmp) ){
       df <- tmp[[z]]
       p  <- zcta[ zcta$zipcode == z, "prop" ]
       out[[z]] <- df[ sample( 1:nrow(df), p*nrow(df) ), ]
    }
    do.call("rbind", out)

You probably need a variant of these but if you need further help, you 
will need to provide more information and better yet examples.

Regards, Adai

Kirsten Beyer wrote:
> I am interested in locating a script to implement a sampling scheme
> that would basically make it more likely that a particular observation
> is chosen based on a weight associated with the observation.  I am
> trying to select a sample of ~30 census blocks from each ZIP code area
> based on the proportion of women in a ZCTA living in a particular
> block.  I want to make it more likely that a block will be chosen if
> the proportion of women in a patient's age group in a particular block
> is high. Any ideas are appreciated!
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
>