[R] Unexpected behaviour of plyr::ddply

walmes . walmeszeviani at gmail.com
Wed Sep 17 05:36:24 CEST 2014


Hello R users,

I'm writing a brief tutorial of getting statistical measures by splitting
according strata and over columns. When I used plyr::ddply I got and
unexpected result, with NA/NaN for non existing cells. Below is a minimal
reproducible code with the result that I got. For comparison, the result of
aggregate is showed. Why this behaviour? What I can do to avoid it?

> require(plyr)
>
> hab <-
+     read.table("http://www.leg.ufpr.br/~walmes/data/ipea_habitacao.csv",
+                header=TRUE, sep=",", stringsAsFactors=FALSE, quote="",
+                encoding="utf-8")
>
> hab <- hab[,-ncol(hab)]
> names(hab) <- c("sig", "cod", "mun", "agua", "ener", "tel", "carro",
+                 "comp", "tot")
> hab <- transform(hab, sig=factor(sig))
> hab$siz <- cut(hab$tot, breaks=c(-Inf, 5000, Inf),
+                labels=c("P","G"))
> str(hab, ve.len=1)
'data.frame':    5596 obs. of  10 variables:
 $ sig  : Factor w/ 27 levels "AC","AL","AM",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ cod  : int  1200013 1200054 1200104 1200138 1200179 1200203 1200252
1200302 1200328 1200336 ...
 $ mun  : chr  "Acrelândia" "Assis Brasil" "Brasiléia" "Bujari" ...
 $ agua : num  21.5 27.4 26.9 17.3 13.1 ...
 $ ener : num  56.2 65.3 55.9 43.9 35.9 ...
 $ tel  : num  8.85 26.71 22.73 12.28 9.19 ...
 $ carro: num  9.3 6.03 7.47 6.49 5.73 ...
 $ comp : num  0.947 1.637 1.857 0.127 0.088 ...
 $ tot  : int  1878 807 4114 1365 1267 14368 2807 5268 740 2308 ...
 $ siz  : Factor w/ 2 levels "P","G": 1 1 1 1 1 2 1 2 1 1 ...
>
> xtabs(~siz+sig, hab)
   sig
siz  AC  AL  AM  AP  BA  CE  DF  ES  GO  MA  MG  MS  MT  PA  PB  PE  PI
PR  RJ
  P  17  72  47  13 266 103   0  43 187 152 671  52 100  80 195  97 199
310  27
  G   5  29  15   3 149  81   1  34  55  65 182  25  26  63  28  88  22
89  64
   sig
siz  RN  RO  RR  RS  SC  SE  SP  TO
  P 147  34  14 355 236  56 396 129
  G  19  18   1 112  57  19 249  10
>
> x <- ddply(hab, ~sig+siz,
+            colwise(.fun=mean,
+                    .cols=~agua+ener+tel+carro+comp, na.rm=TRUE))
> head(x)
  sig  siz     agua     ener       tel    carro      comp
1  AC    P 19.30229 51.30535 12.857118 5.395824 0.7028235
2  AC    G 28.39300 65.95740 26.322800 8.942000 2.3806000
3  AL    P 42.14337 81.20935  4.801500 7.069958 0.5332639
4  AL    G 54.03966 87.47834  9.771724 9.428586 1.3583793
5  *AL <NA>      NaN      NaN       NaN      NaN       NaN*
6  AM    P 25.66202 61.12709  5.749596 1.980362 0.7629362
>
> x <- ddply(hab, ~sig+siz,
+            colwise(.fun=sum,
+                    .cols=~agua+ener+tel+carro+comp, na.rm=TRUE))
> head(x)
  sig  siz     agua     ener     tel   carro   comp
1  AC    P  328.139  872.191 218.571  91.729 11.948
2  AC    G  141.965  329.787 131.614  44.710 11.903
3  AL    P 3034.323 5847.073 345.708 509.037 38.395
4  AL    G 1567.150 2536.872 283.380 273.429 39.393
5  *AL <NA>    0.000    0.000   0.000   0.000  0.000*
6  AM    P 1206.115 2872.973 270.231  93.077 35.858
>
> y <- aggregate(as.matrix(hab[,4:8])~sig+siz, data=hab, FUN=mean,
+                na.rm=TRUE)
> head(y)
  sig siz     agua     ener       tel     carro      comp
1  AC   P 19.30229 51.30535 12.857118  5.395824 0.7028235
2  AL   P 42.14337 81.20935  4.801500  7.069958 0.5332639
3  AM   P 25.66202 61.12709  5.749596  1.980362 0.7629362
4  AP   P 37.20362 82.04515 14.929154  5.775923 0.7922308
5  BA   P 41.23200 66.45516  4.524045 10.387203 0.8404624
6  CE   P 36.78599 78.20176  6.339990  7.768981 0.6446990
>
> sessionInfo()
R version 3.1.1 (2014-07-10)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=pt_BR.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=pt_BR.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=pt_BR.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_BR.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] plyr_1.8.1

loaded via a namespace (and not attached):
[1] compiler_3.1.1 Rcpp_0.11.2    tools_3.1.1
>

Thanks in advance.
Walmes.

==========================================================================
Walmes Marques Zeviani
LEG (Laboratório de Estatística e Geoinformação, 25.450418 S, 49.231759 W)
Departamento de Estatística - Universidade Federal do Paraná
fone: (+55) 41 3361 3573
skype: walmeszeviani
homepage: http://www.leg.ufpr.br/~walmes
linux user number: 531218
==========================================================================

	[[alternative HTML version deleted]]



More information about the R-help mailing list