[R] Calculating SD according to groups of rows

hadley wickham h.wickham at gmail.com
Thu Nov 20 17:20:11 CET 2008


On Thu, Nov 20, 2008 at 10:04 AM, Dieter Menne
<dieter.menne at menne-biomed.de> wrote:
> hadley wickham <h.wickham <at> gmail.com> writes:
>
>> > library(plyr)
>> > dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
>> > daply(dat,.(SUBJECT_ID),sd)
>> > ddply(dat,.(SUBJECT_ID),sd)
>>
>> Well that calculates sd on the whole data frame.  (Like sd(dat)).
>
> Not really, it looks like the breakdown is somehow done:
>
>> library(plyr)
>> dat = data.frame(SUBJECT_ID=sample(letters[1:5],100,TRUE),HR=rnorm(100))
>> daply(dat,.(SUBJECT_ID),sd)
>
> SUBJECT_ID SUBJECT_ID        HR
>         a         NA 1.0488930
>         b         NA 0.9110685
>         c         NA 1.0776996
>         d         NA 1.1724009
>         e         NA 0.9455105
> Warning messages:
> 1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
> ..more warnings
>
>> ddply(dat,.(SUBJECT_ID),sd)
>  SUBJECT_ID        HR
> 1         NA 1.0488930
> 2         NA 0.9110685
> 3         NA 1.0776996
> 4         NA 1.1724009
> 5         NA 0.9455105
> Warning messages:
> 1: In var(as.vector(x), na.rm = na.rm) : NAs introduced by coercion
>
> That's what I meant by "almost correct". Your suggestion works, but wouldn't is
> be a good default to make numcolwise(sd) the default with this close miss?

I have considered it, but I think it makes it harder to use plyr for
the more complicated problems where it really shines.  Being able to
work with the whole data frame, instead of just some subset of the
columns, makes it possible to do much much more.  For example,
because aggregate operates on a column at a time, you can't calculate
the correlation between variables: given a data frame you can always
operate on a column at time, but given a column at a time, you can not
operate on the data frame as a whole.  Plyr chooses to supply your
aggregation function with the whole data frame, and then provides
functions (colwise, numcolwise, catcolwise) that make it easy to
operate column-wise.

Hadley

-- 
http://had.co.nz/



More information about the R-help mailing list