[R] Problem with ddply in the plyr-package: surprising output of a date-column

William Dunlap wdunlap at tibco.com
Mon Apr 25 20:55:06 CEST 2011



Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com  

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Brian Diggs
> Sent: Monday, April 25, 2011 11:05 AM
> To: christoph.jaeckel at wi.tum.de
> Cc: r-help at r-project.org
> Subject: Re: [R] Problem with ddply in the plyr-package: 
> surprising output of a date-column
> 
> On 4/25/2011 10:19 AM, Christoph Jäckel wrote:
> > Hi Together,
> >
> > I have a problem with the plyr package - more precisely 
> with the ddply
> > function - and would be very grateful for any help. I hope 
> the example
> > here is precise enough for someone to identify the problem. 
> Basically,
> > in this step I want to identify observations that are identical in
> > terms of certain identifiers (ID1, ID2, ID3) and just want to save
> > those observations (in this step, without deleting any rows or
> > manipulating any data) in a separate data.frame. However, I get the
> > warning message below and the column with dates is messed up.
> > Interestingly, the value column (the type is factor here, but if you
> > change that with as.integer it doesn't make any difference) 
> is handled
> > correctly. Any idea what I do wrong?
> >
> > df<- 
> data.frame(cbind(ID1=c(1,2,2,3,3,4,4),ID2=c('a','b','b','c','d
','e','e'),ID3=c("v1","v1","v1","v1","v2","v1","v1"),
> >
> > 
> Date=c("1985-05-1","1985-05-2","1985-05-3","1985-05-4","1985-0
> 5-5","1985-05-6","1985-05-7"),
> >                   Value=c(1,2,3,4,5,6,7)))
> > df[,1]<- as.character(df[,1])
> > df[,2]<- as.character(df[,2])
> > df$Date<- strptime(df$Date,"%Y-%m-%d")
> >
> > #Apparently there are two observation that have the same 
> IDs: ID1=2 and ID1=4
> > ddply(df,.(ID1,ID2,ID3),nrow)
> > #I want to save those IDs in a separate data.frame, so the 
> desired output is:
> > df[c(2:3,6:7),]
> >
> > #My idea: Write a custom function that only returns 
> observations with
> > multiple rows.
> > #Seems to work except that the Date column doesn't make any 
> sense anymore
> > #Warning message: In output[[var]][rng]<- df[[var]]: number of items
> > to replace is not a multiple of replacement length
> > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
> >
> > #Notice that it works perfectly if I only have one observation with
> > multiple rows
> > ddply(df[1:6,],.(ID1,ID2,ID3),function(df) 
> if(nrow(df)<=1){NULL}else{df})
> 
> Works for me:
> 
>  > df[c(2:3,6:7),]
>    ID1 ID2 ID3      Date Value
> 2   2   b  v1 1985-05-2     2
> 3   2   b  v1 1985-05-3     3
> 6   4   e  v1 1985-05-6     6
> 7   4   e  v1 1985-05-7     7
>  > ddply(df,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
>    ID1 ID2 ID3      Date Value
> 1   2   b  v1 1985-05-2     2
> 2   2   b  v1 1985-05-3     3
> 3   4   e  v1 1985-05-6     6
> 4   4   e  v1 1985-05-7     7
> [ ... version info elided ... ] 
> A couple of things: there was just an update of plyr to 1.5.2; maybe 
> that fixes what you are seeing?  Also, your df consists of 
> only factors. 
>   cbind-ing the data before turning it into a data.frame makes it a 
> character matrix which gets converted to factors.
> 
>  > str(df)
> 'data.frame':   7 obs. of  5 variables:
>   $ ID1  : Factor w/ 4 levels "1","2","3","4": 1 2 2 3 3 4 4
>   $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
>   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
>   $ Date : Factor w/ 7 levels "1985-05-1","1985-05-2",..: 1 2 
> 3 4 5 6 7
>   $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7

The OP's data.frame contained a POSIXlt (not factor) object
in the "Date" column
  > str(df)
  'data.frame':   7 obs. of  5 variables:
   $ ID1  : chr  "1" "2" "2" "3" ...
   $ ID2  : chr  "a" "b" "b" "c" ...
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
   $ Date : POSIXlt, format: "1985-05-01" "1985-05-02" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 1 2 3 4 5 6 7
and apparently plyr's equivalent of rbind doesn't support that class.

If you want to continue using POSIXlt objects you can get your
immediate result without ddply; subscripting will do the job:
  > nDups <- with(df, ave(rep(0,nrow(df)), ID1, ID2, ID3, FUN=length))
  > print(nDups)
  [1] 1 2 2 1 1 2 2
  > df[nDups>1, ]
    ID1 ID2 ID3       Date Value
  2   2   b  v1 1985-05-02     2
  3   2   b  v1 1985-05-03     3
  6   4   e  v1 1985-05-06     6
  7   4   e  v1 1985-05-07     7
  > str(.Last.value)
  'data.frame':   4 obs. of  5 variables:
   $ ID1  : chr  "2" "2" "4" "4"
   $ ID2  : chr  "b" "b" "e" "e"
   $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1
   $ Date : POSIXlt, format: "1985-05-02" "1985-05-03" ...
   $ Value: Factor w/ 7 levels "1","2","3","4",..: 2 3 6 7

If you need plyr for other tasks you ought to use a different
class for your date data (or wait until plyr can deal with
POSIXlt objects).

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com 

> 
> Maybe that has something to do with the odd "dates" since 
> they are not 
> really dates at all, just string representations of factor levels. 
> Compare with:
> 
> DF <- data.frame(ID1=c(1,2,2,3,3,4,4),
> 	ID2=c('a','b','b','c','d','e','e'),
> 	ID3=c("v1","v1","v1","v1","v2","v1","v1"),
> 	Date=as.Date(c("1985-05-1","1985-05-2","1985-05-3",
> 		"1985-05-4","1985-05-5","1985-05-6","1985-05-7")),
> 	Value=c(1,2,3,4,5,6,7))
> str(DF)
> #'data.frame':   7 obs. of  5 variables:
> # $ ID1  : num  1 2 2 3 3 4 4
> # $ ID2  : Factor w/ 5 levels "a","b","c","d",..: 1 2 2 3 4 5 5
> # $ ID3  : Factor w/ 2 levels "v1","v2": 1 1 1 1 2 1 1
> # $ Date : Date, format: "1985-05-01" "1985-05-02" ...
> # $ Value: num  1 2 3 4 5 6 7
> 
> This version also works for me.
> 
> ddply(DF,.(ID1,ID2,ID3),function(df) if(nrow(df)<=1){NULL}else{df})
> #  ID1 ID2 ID3       Date Value
> #1   2   b  v1 1985-05-02     2
> #2   2   b  v1 1985-05-03     3
> #3   4   e  v1 1985-05-06     6
> #4   4   e  v1 1985-05-07     7
> 
> > Thanks in advance,
> >
> > Christoph
> >
> > 
> --------------------------------------------------------------
> --------------------------------------------------------------
> ----------------------------------------
> >
> > Christoph Jäckel (Dipl.-Kfm.)
> >
> > 
> --------------------------------------------------------------
> --------------------------------------------------------------
> ----------------------------------------
> >
> > Research Assistant
> >
> > Chair for Financial Management and Capital Markets | Lehrstuhls für
> > Finanzmanagement und Kapitalmärkte
> >
> > TUM School of Management | Technische Universität München
> >
> > Arcisstr. 21 | D-80333 München | Germany
> >
> 
> 
> -- 
> Brian S. Diggs, PhD
> Senior Research Associate, Department of Surgery
> Oregon Health & Science University
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 



More information about the R-help mailing list