[R] (Newbie) Aggregate for NA values

Adaikalavan Ramasamy ramasamy at cancer.org.uk
Fri Feb 24 17:05:07 CET 2006


I think it makes perfect sense for R to drop it since 'NA' represents
uninformative information. I do not know if there is a elegant solution
but I would suggest that you make these 'NA' into an informative value.

Here is one possibility:

 df <- data.frame( AA=1:10, BB=rep(1:5,2), CC=rep(1:2,5), DD=rnorm(10) )
 df[ 9:10, "CC" ] <- NA

 df[is.na(df)] <- "lala"   ## change NA's into informative category ##


 aggregate( df$DD, by=list( df$CC ), mean  )
     Group.1          x
   1       1  1.1533763
   2       2  0.6427338
   3    lala -0.2745249

 aggregate( df$DD, by=list( df$BB, df$CC ), mean  )
      Group.1 Group.2           x
   1        1       1  0.47264081
   2        2       1  0.63795211
   3        3       1  1.66756015
   4        5       1  1.83535232
   5        1       2  0.89914287
   6        2       2  1.11102134
   7        3       2  0.22268699
   8        4       2  0.33808394
   9        4    lala -0.60154608
   10       5    lala  0.05249622

Regards, Adai



On Fri, 2006-02-24 at 10:16 -0500, Vivek Satsangi wrote:
> Folks,
> 
> Sorry if this question has been answered before or is obvious (or
> worse, statistically "bad"). I don't understand what was said in one
> of the search results that seems somewhat related.
> 
> I use aggregate to get a quick summary of the data. Part of what I am
> looking for in the summary is, how much influence might the NA's have
> had, if they were included, and is excluding them from the means
> causing some sort of bias. So I want the summary stat for the NA's
> also.
> 
> Here is a simple example session (edited to remove the typos I made,
> comments added later):
> 
> > tmp_a <- 1:10
> > tmp_b <- rep(1:5,2)
> > tmp_c <- rep(1:2,5)
> > tmp_d <- c(1,1,1,2,2,2,3,3,3,4)
> > tmp_df <- data.frame(tmp_a,tmp_b,tmp_c,tmp_d);
> > tmp_df$tmp_c[9:10] <- NA ;
> > tmp_df
>    tmp_a tmp_b tmp_c tmp_d
> 1      1     1     1     1
> 2      2     2     2     1
> 3      3     3     1     1
> 4      4     4     2     2
> 5      5     5     1     2
> 6      6     1     2     2
> 7      7     2     1     3
> 8      8     3     2     3
> 9      9     4    NA     3
> 10    10     5    NA     4
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_b,tmp_df$tmp_c),mean);
>   Group.1 Group.2 x
> 1       1       1 1
> 2       2       1 3
> 3       3       1 1
> 4       5       1 2
> 5       1       2 2
> 6       2       2 1
> 7       3       2 3
> 8       4       2 2
> # Only one row for each (tmp_b, tmp_c) combination, NA's getting dropped.
> 
> > aggregate(tmp_df$tmp_d,by=list(tmp_df$tmp_c),mean);
>   Group.1    x
> 1       1 1.75
> 2       2 2.00
> 
> What I want in this last aggregate is, a mean for the values in tmp_d
> that correspond to the tmp_c values of NA. Similarly, perhaps there is
> a way to make the second last call to aggregate return the values of
> tmp_d for the NA values of tmp_c also.
> 
> How can I achieve this?
> 
> --
> -- Vivek Satsangi
> Student, Rochester, NY USA
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>




More information about the R-help mailing list