[R] descriptive stats by cells in factorial design

David Winsemius dwinsemius at comcast.net
Wed Aug 7 02:20:17 CEST 2013


On Aug 6, 2013, at 4:02 PM, Mike Miller wrote:

> I received two additional suggestions, one off-list, both appended below. Both helped me to learn a bit more about how to get what I want.
> 
> First, the aggregate() function is in package:stats, it provides the numbers I needed, but I don't like the output format as much as I liked the format from doBy:summaryBy().  Here it is:
> 
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, function(x) c(mean=mean(x), sd=sd(x), quantile(x), N=length(x)))
>   Generation Zygosity    Sex Cohort ESstatus    Age.mean      Age.sd      Age.0%     Age.25%     Age.50%     Age.75%    Age.100%       Age.N
> 1   Offspring       DZ Female     11       ES  17.7852830   0.3535863  16.9300000  17.6000000  17.7750000  17.9650000  18.9200000 106.0000000
> 2      Parent       DZ Female     11       ES  44.6151240   5.1246314  32.1700000  41.3400000  44.6800000  48.2800000  57.9500000 121.0000000
> 
snipped
> 23  Offspring       MZ   Male     17    notES  17.4911446   0.3961757  16.6500000  17.1775000  17.5000000  17.8100000  18.3500000 332.0000000
> 24     Parent       MZ   Male     17    notES  46.6929771   5.2421896  34.4500000  43.1500000  45.8900000  49.0050000  63.8000000 131.0000000
> 
> That's great but there are two things I didn't like:  (1) There too many digits, especially on the integers in the last column.  I thought five digits to the right of the decimal was more than enough but here we have seven, even for integers.  (2) The ordering of levels within factors implied by the right side of the formula is not honored -- it looks like it used the order Cohort, ESstatus, Sex, Zygosity, Generation.  Unlike doBy::summaryBy(), it does not accept an order=T argument (that is the default in doBy::summaryBy()).
> 
> One thing both suggestions taught me was to use names in function definitions so that I always get correct column headings on output.  This was in the documentation for doBy::summaryBy(), but I didn't understand it when I first read it.  Using that naming concept, I created this function:
> 
> descriptivefun <- function(x, ...){c(mean=mean(x, ...), sd=sd(x, ...), quantile(x, ...), N=sum(!is.na(x)), NAs=sum(is.na(x)))}
> 
> That will allow me to feed the na.rm=T argument to the mean, sd and quantile functions.  By not naming the quantile function (e.g., not using q=quantile(x, ...)) I allow the builtin column names to be used unaltered (i.e., 0%, 25%, 50%, 75%, 100%).  I also did not use length() because it will count NA values and I want to see the sample sizes used for mean, sd and quantile.  To deal with that problem I created a function with output named "N" to count those sample sizes and one with output named "NAs" to count the number of NAs.  Then I do this:
> 
>> summaryBy(Age ~ Generation + Zygosity + Sex + Cohort + ESstatus, data=x, FUN=descriptivefun, na.rm=T)
>   Generation Zygosity    Sex Cohort ESstatus Age.mean    Age.sd Age.0% Age.25% Age.50% Age.75% Age.100% Age.N Age.NAs
> 1   Offspring       DZ Female     11       ES 17.78528 0.3535863  16.93 17.6000  17.775 17.9650    18.92   106       0
> 2   Offspring       DZ Female     11    notES 18.13679 0.5555968  16.76 17.8525  18.190 18.4575    19.50   162       0
> 
snipped
> 22     Parent       MZ   Male     11       ES 43.40787 5.3507439  31.28 39.9700  43.440 46.4800    64.65   197       0
> 23     Parent       MZ   Male     11    notES 41.56363 4.6564818  32.10 38.0250  41.390 44.6450    65.29   331       0
> 24     Parent       MZ   Male     17    notES 46.69298 5.2421896  34.45 43.1500  45.890 49.0050    63.80   131       0
> 
> I think that output looks very nice.  One thing that I don't understand is why my function produces %.5f output for every value but the doBy::summaryBy() function uses different formats in different columns.

Look at the code. You are attributing behavior to `summaryBy` that should be ascribed to `print.data.frame`, and to `format.data.frame`. Your function is returning a numeric vector and getting displayed by `print.default`.

-- 
David.

> Compare the above output with this output:
> 
>> descriptivefun(x$Age)
>      mean         sd         0%        25%        50%        75%       100%          N        NAs
>  28.49302   13.29077   16.55000   17.65000   18.23000   42.25500   65.29000 4434.00000    0.00000
> 
> It's not a big deal, but it would be cool if I could tell doBy::summaryBy() how to format the numbers using something like format=c(rep("%.2f",7), rep("%d",2)).
> 
> Mike
> 
> --
> Michael B. Miller, Ph.D.
> Minnesota Center for Twin and Family Research
> Department of Psychology
> University of Minnesota
> 
> 
> 
> On Mon, 5 Aug 2013, David Carlson wrote:
> 
>> This is a bit simpler. The function quantile() labels the output whereas fivenum() does not:
>> 
>> aggregate(Age ~ Generation + Zygosity + Sex + Cohort +
>> ESstatus, data=x,
>>   function(x) c(mean=mean(x), sd=sd(x), quantile(x)))
> 
> 
> On Mon, 5 Aug 2013, Dr. Thomas W. MacFarland wrote:
> 
>> Dear Dr. Miller:
>> 
>> Pasted below is syntax that should mostly answer your recent question to the R mailing list:
>> 
>> descriptivefun <- function(x, ...){
>> c(m=mean(x, ...), sd=sd(x, ...), l=length(x))
>> }
>> 
>> doBy::summaryBy(Final ~ Method.recode +
>> ComCol.recode,
>> data=Final.table,
>> FUN=descriptivefun,
>> na.rm=TRUE,
>> keep.names=TRUE,
>> order=TRUE)
>> 
>> I go into far more detail on this package::function and similar functions in my recent text on Twoway ANOVA,
>> http://www.springer.com/statistics/social+sciences+%26+law/book/978-1-4614-2133-7.
>> 
>> Best wishes.
>> 
>> Tom

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list