[R] Fast Normalize by Group

jim holtman jholtman at gmail.com
Thu Nov 29 20:13:17 CET 2012


try the 'data.table' package.  Takes about 0.1 seconds to normalize the data.

> x <- data.frame(id = sample(10000, 100000, TRUE), value = runif(100000))
> require(data.table)
Loading required package: data.table
data.table 1.8.2  For help type: help("data.table")
> system.time({
+     x <- data.table(x)
+     newX <- x[
+         , list(value = value  # keep original value
+             , normValue = value / sum(value)
+             )
+         , by = id
+         ]
+ })
   user  system elapsed
   0.03    0.01    0.11
>
> head(newX, 20)
      id     value   normValue
 1: 8094 0.6805425 0.101140797
 2: 8094 0.3154233 0.046877543
 3: 8094 0.8998646 0.133735993
 4: 8094 0.8858863 0.131658564
 5: 8094 0.1859526 0.027635892
 6: 8094 0.4694456 0.069768023
 7: 8094 0.9302886 0.138257544
 8: 8094 0.7482040 0.111196505
 9: 8094 0.9052426 0.134535255
10: 8094 0.4650028 0.069107739
11: 8094 0.2428116 0.036086145
12: 6287 0.1979209 0.037505820
13: 6287 0.5117723 0.096980353
14: 6287 0.6425769 0.121767688
15: 6287 0.0397795 0.007538177
16: 6287 0.1255722 0.023795811
17: 6287 0.5606742 0.106247214
18: 6287 0.4818579 0.091311594
19: 6287 0.3913614 0.074162596
20: 6287 0.4622984 0.087605098
>


On Thu, Nov 29, 2012 at 1:55 PM, Noah Silverman <noahsilverman at ucla.edu> wrote:
> Hi,
>
> I have a very large data set (aprox. 100,000 rows.)
>
> The data comes from around 10,000 "groups" with about 10 entered per group.
>
> The values are in one column, the group ID is an integer in the second column.
>
> I want to normalize the values by group:
>
> for(g in unique(groups){
>         x[group==g] / sum(x[group==g])
> }
>
> This works find in a loop, but is slow.  Is there a faster way to do this?
>
> Thanks!
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?
Tell me what you want to do, not how you want to do it.




More information about the R-help mailing list