[R] aggregate function - na.action

jim holtman jholtman at gmail.com
Sun Feb 6 23:42:58 CET 2011


Try 'data.table' package.  It took 3 seconds to aggregate the 500K
levels:  Is this what you were after?

> # note the characters are converted to factors that 'data.table' likes
> dat=data.frame(
+        x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
+        x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
+        x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
+        x4=sample(c(NA,T,F), 2e6, replace=TRUE),
+        x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
+ replace=TRUE),
+        x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
+        x7=sample(c(NA,'married','divorced','separated','single','etc'),
+ 2e6, replace=TRUE),
+        x8=sample(c(NA,T,F), 2e6, replace=TRUE),
+        y=trunc(rnorm(2e6)*10000))
> str(dat)
'data.frame':   2000000 obs. of  9 variables:
 $ x1: Factor w/ 2 levels "f","m": NA NA 2 NA NA NA NA 1 1 1 ...
 $ x2: int  4 5 3 10 10 7 1 1 3 5 ...
 $ x3: Factor w/ 5 levels "a","b","c","d",..: 3 2 1 2 1 5 1 1 2 1 ...
 $ x4: logi  NA TRUE TRUE NA FALSE NA ...
 $ x5: Factor w/ 4 levels "active","deleted",..: 4 3 3 2 2 1 1 NA 3 3 ...
 $ x6: int  NA 2 7 2 1 9 NA 1 1 9 ...
 $ x7: Factor w/ 5 levels "divorced","etc",..: 1 3 5 NA 2 3 1 2 2 2 ...
 $ x8: logi  NA NA NA FALSE FALSE FALSE ...
 $ y : num  3066 -13237 -7840 9728 1596 ...
> require(data.table)
> dat <- data.table(dat)
> system.time(result <- dat[, sum(y), by = list(x1,x2,x3,x4,x5,x6,x7,x8)])
   user  system elapsed
   3.11    0.16    3.26
> str(result)
Classes ‘data.table’ and 'data.frame':  568594 obs. of  9 variables:
 $ x1: Factor w/ 2 levels "f","m": NA NA NA NA NA NA NA NA NA NA ...
 $ x2: int  NA NA NA NA NA NA NA NA NA NA ...
 $ x3: Factor w/ 5 levels "a","b","c","d",..: NA NA NA NA NA NA NA NA NA NA ...
 $ x4: logi  NA NA NA NA NA NA ...
 $ x5: Factor w/ 4 levels "active","deleted",..: NA NA NA NA NA NA NA
NA NA NA ...
 $ x6: int  NA NA NA NA NA NA NA NA NA NA ...
 $ x7: Factor w/ 5 levels "divorced","etc",..: NA NA NA 1 1 1 2 2 2 3 ...
 $ x8: logi  NA FALSE TRUE NA FALSE TRUE ...
 $ V1: num  6641 -18158 3 -11202 -14437 ...
>
>


On Sun, Feb 6, 2011 at 3:54 PM, Gene Leynes <gleynes+r at gmail.com> wrote:
> On Fri, Feb 4, 2011 at 6:54 PM, Ista Zahn <izahn at psych.rochester.edu> wrote:
>
>> >
>> > However, I don't think you've told us what you're actually trying to
>> > accomplish...
>> >
>>
>
> I'm trying to aggregate the y value of a big data set which has several x's
> and a y.
> I'm using an abstracted example for many reasons.  Partially, I'm using an
> abstracted example to comply with the posting guidelines of having a
> reproducible example.  I'm really aggregating some incredibly boring and
> complex customer data for an undisclosed client.
>
> As it turns out,
> Aggregate will not work when some of x's are NA, unless you convert them to
> factors, with NA's included.
>
> In my case, the data is so big that doing the conversions causes other
> memory problems, and renders some of my numeric values useless for other
> calculations.
>
> My real data looks more like this (except with a few more categories and
> records):
>
> set.seed(100)
> library(plyr)
> dat=data.frame(
>        x1=sample(c(NA,'m','f'), 2e6, replace=TRUE),
>        x2=sample(c(NA, 1:10), 2e6, replace=TRUE),
>        x3=sample(c(NA,letters[1:5]), 2e6, replace=TRUE),
>        x4=sample(c(NA,T,F), 2e6, replace=TRUE),
>        x5=sample(c(NA,'active','inactive','deleted','resumed'), 2e6,
> replace=TRUE),
>        x6=sample(c(NA, 1:10), 2e6, replace=TRUE),
>        x7=sample(c(NA,'married','divorced','separated','single','etc'),
> 2e6, replace=TRUE),
>        x8=sample(c(NA,T,F), 2e6, replace=TRUE),
>        y=trunc(rnorm(2e6)*10000), stringsAsFactors=F)
> str(dat)
> ## The control total
> sum(dat$y, na.rm=T)
> ## The aggregate total
> sum(aggregate(dat$y, dat[,1:8], sum, na.rm=T)$x)
> ## The ddply total
> sum(ddply(dat, .(x1,x2,x3,x4,x5,x6,x7,x8), function(x)
>        {data.frame(y.sum=sum(x$y,na.rm=TRUE))})$y.sum)
>
> ddply worked a little better than I expected at first, but it slows to a
> crawl or has runs out of memory too often for me to invest the time learning
> how to use it.  Just now it worked for 1m records, and it was just a bit
> slower than aggregate.  But for the 2m example it hasn't finished
> calculating.
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list