[R] converting stata's by syntax to R

Thomas Lumley tlumley at u.washington.edu
Mon Aug 1 18:43:03 CEST 2005


On Mon, 1 Aug 2005, Chris Wallace wrote:

> I am struggling with migrating some stata code to R.  I have a data
> frame containing, sometimes, repeat observations (rows) of the same
> family.  I want to keep only one observation per family, selecting
> that observation according to some other variable.  An example data
> frame is:
>
> # construct example data
> fam <- c(1,2,3,3,4,4,4)
> wt <- c(1,1,0.6,0.4,0.4,0.4,0.2)
> keep <- c(1,1,1,0,1,0,0)
> dat <- as.data.frame(cbind(fam,wt,keep))
> dat
>
> I want to keep the observation for which wt is a maximum, and where
> this doesn't identify a unique observation, to keep just one anyway,
> not caring which.  Those observations are indicated above by keep==1.
> (Note, keep <- c(1,1,1,0,0,1,0) would be fine too, but not
> c(1,1,1,0,0,0,1)).
>
> The stata code I would use is
> bys fam (wt): keep if _n==_N

A reasonably direct translation of the Stata code is

   index <- order(fam, -wt)
   keep <- !duplicated(fam[index])
   dat <- data.frame(fam=fam[index], wt=wt[index], keep=keep)

which sorts wt into decreasing order within family, then keeps the first 
observation in each family.

This is less general than solutions other people have given, but I'd 
expect it to be faster for large data sets. 'keep' ends up TRUE/FALSE 
rather than 1/0; if this is a problem use as.numeric() on it.

 	-thomas




More information about the R-help mailing list