[R] Using summaryBy with weighted data

Joshua Wiley jwiley.psych at gmail.com
Mon Jan 17 08:16:15 CET 2011


Dear Solomon,

On Sun, Jan 16, 2011 at 10:27 PM, Solomon Messing
<solomon.messing at gmail.com> wrote:
> Dear Soren and R users:
>
> I am trying to use the summaryBy function with weights.  Is this possible?  An example that illustrates what I am trying to do follows:
>
> library(doBy)
> ## make up some data
> response = rnorm(100)
> group = c(rep(1,20), rep(2,20), rep(3,20), rep(4,20), rep(5,20))
> weights = runif(100, 0, 1)
> mydata = data.frame(response,group,weights)
>
> ## run summaryBy without weights:
> summaryBy(response~group, data = mydata, FUN = mean)
>
> ## attempt to run summaryBy with weights, throws error
> summaryBy(x~group, data = mydata, FUN = weighted.mean, w=weights )
>
> ## throws the error:
> # Error in tapply(lh.data[, lh.var[vv]], rh.string.factor, function(x) { :
> #                                       arguments must have same length
>
> My guess is that summaryBy is not giving weighted.mean() each group of weights, but instead is passing all of the weights in the data set each time it calls weighted.mean().

Yes, of course.  It has no way of knowing that the weights should also
be being broken down by group....they are not in the formula.

>  Do you know if there is some way to get summaryBy to pass weights to weighted.mean() only for each group?

Ideally there would be a way to pass more than one variable to a
function (e.g., response and weights) or just an entire object
(mydata) broken down by group.  Then you would just make a wrapper
function to pass the right values to the x and w arguments of
weighted.mean.  Instead here is a somewhat hacked version:

library(doBy)
## make up some data (easier)
mydata <- data.frame(response = rnorm(100),
 group = rep(1:5, each = 20), weights = runif(100, 0, 1))

## manually compute weighted mean
tmp <- summaryBy(response*weights ~ group, data = mydata, FUN = sum)
tmp[,2] <- tmp[,2]/with(mydata, tapply(weights, group, sum))
tmp ## weighted means

## here's the 'problem', if you will, even with  +, they are passed
one at a time
summaryBy(response + weights ~ group, data = mydata, FUN = str)
summaryBy(mydata ~ group, data = mydata, FUN = str)

## here is an option using by():
xy <- by(mydata, mydata$group, function(z) weighted.mean(z$response, z$weights))
xy
## if you don't like the formatting....
data.frame(group = names(c(xy)), weighted.mean = c(xy))

HTH,

Josh

>
> I suspect this functionality would be a tremendous benefit to R users who regularly work with weighted data, such as myself.
>
> Thanks,
>
> Solomon Messing
> www.stanford.edu/~messing
>
> PS I know this basic example can be done using lapply(split(...)) approach referenced here:
>
> http://www.mail-archive.com/r-help@stat.math.ethz.ch/msg12349.html
>
> but for more complex tasks the lapply approach will mean writing a lot of extra code to run everything and then to get things formatted as nicely as summaryBy() was designed to do.
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Joshua Wiley
Ph.D. Student, Health Psychology
University of California, Los Angeles
http://www.joshuawiley.com/



More information about the R-help mailing list