[R] Applying by() when groups have different lengths

Mon Sep 17 21:38:01 CEST 2018

Inline.

Bert

On Mon, Sep 17, 2018 at 11:54 AM Rich Shepard <rshepard using appl-ecosys.com>
wrote:

>    My dataframe has 113K rows split by a factor into 58 separate
> data.frames,
> with a different numbers of rows (see error output below).
>
>    I cannot think of a way of proving a sample of data; if a sample for a
> MWE
> is desired advice on produing one using dput() is needed.
>

This is gibberish. What does "proving a sample of data" mean? etc. Please
proofread and edit.

>
>    To summarize each group within this dataframe I'm using by() and getting
> an error because of the different number of rows:
>

> > by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
> + mean.rain <- mean(rainfall_by_site[, 'prcp'])
> + })
>

You are misspecifying your function. It has argument x, but you do not use
x in your function. Also the assignment at the end is unnecessary and
probably wrong for your use case. Please go through a tutorial on how to
write functions in R.

You are probably also misusing by(), but as you did not provided sufficient
information -- head(your_data_frame) or similar would have told us its
structure, rather than having us guess -- nor a reproducible example, it's
hard (for me) to figure out your intent. **PLEASE** follow the posting
guide and provide such information. You have been requested to do this
several times already.

Here is the sort of thing I think you wanted to do:

> set.seed(54321) ## for reproducibility
> df <- data.frame(f = sample(LETTERS[1:3], 12, rep = TRUE), y = runif(12))
> df
   f          y
1  B 0.04529991
2  B 0.65272100
3  A 0.99406601
4  A 0.67763735
5  A 0.91854517
6  C 0.46244494
7  A 0.57141480
8  A 0.45193882
9  B 0.16770701
10 B 0.06826135
11 A 0.89691069
12 C 0.27383703

> by(df, df$f, function(x)mean(x$y))
df$f: A
[1] 0.7517521
------------------------------------------------------
df$f: B
[1] 0.2334973
------------------------------------------------------
df$f: C
[1] 0.368141

Note that you do not first break up the df into separate df's, which sounds
like what you tried to do.

However, note that if all you want to do is summarize a *single* numeric
column by a factor, you do not need to use by() at all, which is designed
to work on (several columns of) the whole data frame simultaneously. For a
single column, tapply() is all you need (or as Duncan noted, functionality
in the dplyr package.

> with(df,tapply(y,f,mean))
        A         B         C
0.7517521 0.2334973 0.3681410

Finally, if I have misunderstood your intent, my apologies. I tried.

-- Bert

mean.rain <- by(rainfall_by_site, rainfall_by_site[, 'name'], function(x) {
+ mean.rain <- mean(rainfall_by_site[, 'prcp'])
+ })

> Error in (function (..., row.names = NULL, check.rows = FALSE, check.names
> = TRUE,  :
>    arguments imply differing number of rows: 4900, 1085, 1894, 2844, 3520,
>   647, 239, 3652, 3701, 3063, 176, 4713, 4887, 119, 165, 1221, 3358, 1457,
>   4896, 166, 690, 1110, 212, 1727, 227, 236, 1175, 1485, 186, 769, 139,
> 203,
>   2727, 4357, 1035, 1329, 1454, 973, 4536, 208, 350, 125, 3437, 731, 4894,
>   2598, 2419, 752, 427, 136, 685, 4849, 914, 171
>
>    My web searches have not found anything relevant; perhaps my search
> terms
> (such as 'R: apply by() with different factor row numbers') can be
> improved.
>
>    The help pages found using apropos('by') appear the same: ?by,
> ?by.data.frame, ?by.default and provide no hint on how to work with unequal
> rows per factor.
>
>    How can I apply by() on these data.frames?
>
> Rich
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]