[R] dplyr/summarize does not create a true data frame

Hadley Wickham h.wickham at gmail.com
Sun Nov 23 18:34:38 CET 2014


This bug is fixed in the dev version.
Hadley

On Sunday, November 23, 2014, John Posner <john.posner at mjbiostat.com> wrote:

> Thanks to John Kane for an off-list consultation. As the following
> annotated transcript shows, it's the group_by() function that transforms a
> data frame into something else:  a "grouped_df" object that *looks*
> identical to the original data frame (e.g. the rows are in the original
> order -- *not* grouped, as arrange() would do), but does not always act
> like a data frame.
>
> > library(dplyr)
>
> > # set up data frame, and show its structure [ see below for clean copy
> of dput() code ]
> >
> > frm = structure(list(Id = structure(1:10, .Label = c("P01", "P02",
> + "P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class =
> "factor"),
> +     Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label =
> c("Female",
> +     "Male"), class = "factor"), Height = structure(c(1L, 1L,
> +     3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
> +     "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
> +     73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
> + "Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")
> >
> > str(frm)
> 'data.frame':   10 obs. of  4 variables:
>  $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
>  $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
>  $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
>  $ Value : num  69.5 64.6 74.8 73.3 64.8 ...
>
> > # run group_by() on data frame, and show resulting structure
> >
> > after.group_by = frm %>% group_by(Sex, Height)
>
> > str(after.group_by)
> Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 10 obs. of  4
> variables:
>  $ Id    : Factor w/ 10 levels "P01","P02","P03",..: 1 2 3 4 5 6 7 8 9 10
>  $ Sex   : Factor w/ 2 levels "Female","Male": 2 1 1 2 2 2 1 2 2 1
>  $ Height: Factor w/ 3 levels "Short","Medium",..: 1 1 3 2 1 3 1 2 1 1
>  $ Value : num  69.5 64.6 74.8 73.3 64.8 ...
>  - attr(*, "vars")=List of 2
>   ..$ : symbol Sex
>   ..$ : symbol Height
>  - attr(*, "drop")= logi TRUE
>  - attr(*, "indices")=List of 5
>   ..$ : int  1 6 9
>   ..$ : int 2
>   ..$ : int  0 4 8
>   ..$ : int  3 7
>   ..$ : int 5
>  - attr(*, "group_sizes")= int  3 1 3 2 1
>  - attr(*, "biggest_group_size")= int 3
>  - attr(*, "labels")='data.frame':      5 obs. of  2 variables:
>   ..$ Sex   : Factor w/ 2 levels "Female","Male": 1 1 2 2 2
>   ..$ Height: Factor w/ 3 levels "Short","Medium",..: 1 3 1 2 3
>   ..- attr(*, "vars")=List of 2
>   .. ..$ : symbol Sex
>   .. ..$ : symbol Height
>
> > # the two data structure *seem* to be the same ...
>
> > frm == after.group_by
>         Id  Sex Height Value
>  [1,] TRUE TRUE   TRUE  TRUE
>  [2,] TRUE TRUE   TRUE  TRUE
>  [3,] TRUE TRUE   TRUE  TRUE
>    ...etc.
>
> > # ... but they're not
>
> > frm[4]
>    Value
> 1  69.47
> 2  64.61
>    ...etc.
>
> > after.group_by[4]
> Error in eval(expr, envir, enclos) : index out of bounds
>
> > # fortunately, we can convert back to a true data frame
>
> > as.data.frame(after.group_by)[4]
>    Value
> 1  69.47
> 2  64.61
>    ...etc.
>
> ################################## dput() code below
>
> structure(list(Id = structure(1:10, .Label = c("P01", "P02",
> "P03", "P04", "P05", "P06", "P07", "P08", "P09", "P10"), class = "factor"),
>     Sex = structure(c(2L, 1L, 1L, 2L, 2L, 2L, 1L, 2L, 2L, 1L), .Label =
> c("Female",
>     "Male"), class = "factor"), Height = structure(c(1L, 1L,
>     3L, 2L, 1L, 3L, 1L, 2L, 1L, 1L), .Label = c("Short", "Medium",
>     "Tall"), class = "factor"), Value = c(69.47, 64.61, 74.77,
>     73.31, 64.76, 72.78, 64.64, 55.96, 60.45, 51.11)), .Names = c("Id",
> "Sex", "Height", "Value"), row.names = c(NA, -10L), class = "data.frame")
>
>
>
>
> > -----Original Message-----
> > From: John Kane [mailto:jrkrideau at inbox.com <javascript:;>]
> > Sent: Friday, November 21, 2014 12:33 PM
> > To: John Posner; 'r-help at r-project.org <javascript:;>'
> > Subject: RE: [R] dplyr/summarize does not create a true data frame
> >
> > Your code in creating 'frm' is not working for me and it is complicated
> enough
> > that I don't want to work it out. See ?dput for a better way to supply
> data.
> > Also see:
> > https://github.com/hadley/devtools/wiki/Reproducibility
> >  http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-
> > reproducible-example
> >
> > That said, I don't see why 'my.output[4]' is not working.  Try something
> like
> > str(frm) to see what you have there and/or resubmit the data in dput
> format
> >
> > See simple example below:
> >
> > dat1  <- data.frame(aa = sample(1:20, 100, replace = TRUE), bb = 1:100 )
> > dat1[2]
> >
> > John Kane
> > Kingston ON Canada
> >
> >
> > > -----Original Message-----
> > > From: john.posner at mjbiostat.com <javascript:;>
> > > Sent: Fri, 21 Nov 2014 17:10:16 +0000
> > > To: r-help at r-project.org <javascript:;>
> > > Subject: [R] dplyr/summarize does not create a true data frame
> > >
> > > I got an error when trying to extract a 1-column subset of a data
> > > frame (called "my.output") created by dplyr/summarize. The ncol()
> > > function says that my.output has 4 columns, but "my.output[4]" fails.
> > > Note that converting my.output using as.data.frame() makes for a happy
> > ending.
> > >
> > > Is this the intended behavior of dplyr?
> > >
> > > Tx,
> > > John
> > >
> > >> library(dplyr)
> > >
> > >> # set up data frame
> > >> rows = 100
> > >> repcnt = 50
> > >> sexes = c("Female", "Male")
> > >> heights = c("Med", "Short", "Tall")
> > >
> > >> frm = data.frame(
> > > +   Id = paste("P", sprintf("%04d", 1:rows), sep=""),
> > > +   Sex = sample(rep(sexes, repcnt), rows, replace=T),
> > > +   Height = sample(rep(heights, repcnt), rows, replace=T),
> > > +   V1 = round(runif(rows)*25, 2) + 50,
> > > +   V2 = round(runif(rows)*1000, 2) + 50,
> > > +   V3 = round(runif(rows)*350, 2) - 175
> > > + )
> > >>
> > >> # use dplyr/summarize to create data frame my.output = frm %>%
> > > +   group_by(Sex, Height) %>%
> > > +   summarize(V1sum=sum(V1), V2sum=sum(V2))
> > >
> > >> # work with columns in the output data frame
> > >> ncol(my.output)
> > > [1] 4
> > >
> > >> my.output[1]
> > > Source: local data frame [6 x 1]
> > > Groups: Sex
> > >
> > >      Sex
> > > 1 Female
> > > 2 Female
> > > 3 Female
> > > 4   Male
> > > 5   Male
> > > 6   Male
> > >
> > >> my.output[4]
> > > Error in eval(expr, envir, enclos) : index out of bounds  ########
> > > ERROR HERE
> > >
> > >> as.data.frame(my.output)[4]
> > >      V2sum
> > > 1 12427.97
> > > 2  8449.82
> > > 3  8610.97
> > > 4  7249.20
> > > 5 12616.91
> > > 6 10372.15
> > >>
> > >
> > > ______________________________________________
> > > R-help at r-project.org <javascript:;> mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html
> > > and provide commented, minimal, self-contained, reproducible code.
> >
> > __________________________________________________________
> > __
> > FREE ONLINE PHOTOSHARING - Share your photos online with your friends
> > and family!
> > Visit http://www.inbox.com/photosharing to find out more!
> >
>
> ______________________________________________
> R-help at r-project.org <javascript:;> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
http://had.co.nz/

	[[alternative HTML version deleted]]



More information about the R-help mailing list