[R] User-defined functions in dplyr

Axel Urbiz axel.urbiz at gmail.com
Tue Nov 3 02:38:53 CET 2015


Actually, the results are not the same. Looks like in the code below (see "using dplyr”), the function create_bins2 is not being applied separately to each "group_by" variable. That is surprising to me, or I'm misunderstanding dplyr.

### Create some data

set.seed(4)
df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels = c("model1", "model2")))

### This is the code using plyr, which I'd like to change using dplyr

create_bins <- function(x, nBins) {
  Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
  dfB <-  data.frame(pred = x$pred,
                                bin = cut(x$pred, breaks = Breaks, include.lowest = TRUE))
  dfB
}

nBins = 10
res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
head(res_plyr)

### Attempt using dplyr

create_bins2 <- function (pred, nBins) {
  Breaks <- unique(quantile(pred, probs = seq(0, 1, 1/nBins)))
  bin <- cut(pred, breaks = Breaks, include.lowest = TRUE)
  bin
}

res_dplyr <- dplyr::mutate(dplyr::group_by(df, models),
                                          bin=create_bins2(pred, nBins))


identical(res_plyr, as.data.frame(res_dplyr))
[1] FALSE
#levels(res_dplyr$bin) == levels(res_plyr$bin)

Thanks,
Axel.



> On Oct 30, 2015, at 12:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
> 
> dplyr::mutate is probably what you want instead of dplyr::summarize:
> 
> create_bins3 <- function (xpred, nBins) 
> {
>     Breaks <- unique(quantile(xpred, probs = seq(0, 1, 1/nBins)))
>     bin <- cut(xpred, breaks = Breaks, include.lowest = TRUE)
>     bin
> }
> dplyr::group_by(df, models) %>% dplyr::mutate(Bin=create_bins3(pred,nBins))
> #Source: local data frame [100 x 3]
> #Groups: models [2]
> #
> #         pred models               Bin
> #        (dbl) (fctr)            (fctr)
> #1   0.2167549 model1     (0.167,0.577]
> #2  -0.5424926 model1   (-0.869,-0.481]
> ...
> 
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com/>
> On Fri, Oct 30, 2015 at 9:06 AM, William Dunlap <wdunlap at tibco.com <mailto:wdunlap at tibco.com>> wrote:
> The error message is not very helpful and the stack trace is pretty inscrutable as well
> > dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
> Error: not a vector
> > traceback()
> 14: stop(list(message = "not a vector", call = NULL, cppstack = NULL))
> 13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
> 12: summarise_impl(.data, dots)
> 11: summarise_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
> 10: summarise_(.data, .dots = lazyeval::lazy_dots(...))
> 9: dplyr::summarize(., create_bins)
> 8: function_list[[k]](value)
> 7: withVisible(function_list[[k]](value))
> 6: freduce(value, `_function_list`)
> 5: `_fseq`(`_lhs`)
> 4: eval(expr, envir, enclos)
> 3: eval(quote(`_fseq`(`_lhs`)), env, env)
> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
> 1: dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
> 
> 
> It does not mean that your function, create_bins, does not return a vector --
> the sum function gives the same result. help(summarize,package="dplyr")
> says:
>      ...: Name-value pairs of summary functions like ‘min()’, ‘mean()’,
>           ‘max()’ etc.
> It apparently means calls to summary functions, not summary functions
> themselves.  The examples in the help file show the proper usage.
> 
> Use a call to your function and you will see it works better
>    > dplyr::group_by(df, models) %>% dplyr::summarize(create_bins(pred,nBins))
>    Error: $ operator is invalid for atomic vectors
> The traceback again is not very useful, because the call information was
> stripped by dplyr (by the call=NULL in the call to stop()):  
>   > traceback()
>   14: stop(list(message = "$ operator is invalid for atomic vectors", 
>           call = NULL, cppstack = NULL))
>   13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
> However it is clear that the fault is in your function, which is expecting a
> data.frame x with a column called pred but gets pred itself.  Change x to xpred
> in the argument list and x$pred to xpred in the body of the function.
> 
> You will run into more problems because your function returns a vector
> the length of its input but summarize expects a summary function - one
> that returns a scalar for any size vector input.
> 
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com <http://tibco.com/>
> 
> On Fri, Oct 30, 2015 at 4:04 AM, Axel Urbiz <axel.urbiz at gmail.com <mailto:axel.urbiz at gmail.com>> wrote:
> So in this case, "create_bins" returns a vector and I still get the same
> error.
> 
> 
> create_bins <- function(x, nBins)
> {
>   Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>   bin <- cut(x$pred, breaks = Breaks, include.lowest = TRUE)
>   bin
> }
> 
> 
> ### Using dplyr (fails)
> nBins = 10
> by_group <- dplyr::group_by(df, models)
> res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
> Error: not a vector
> 
> On Thu, Oct 29, 2015 at 8:28 PM, Jeff Newmiller <jdnewmil at dcn.davis.ca.us <mailto:jdnewmil at dcn.davis.ca.us>>
> wrote:
> 
> > You are jumping the gun (your other email did get through) and you are
> > posting using HTML (which does not come through on the list). Some time
> > (re)reading the Posting Guide mentioned at the bottom of all emails on this
> > list seems to be in order.
> >
> > The error is actually quite clear. You should return a vector from your
> > function, not a data frame.
> > ---------------------------------------------------------------------------
> > Jeff Newmiller                        The     .....       .....  Go Live...
> > DCN:<jdnewmil at dcn.davis.ca.us <mailto:jdnewmil at dcn.davis.ca.us>>        Basics: ##.#.       ##.#.  Live
> > Go...
> >                                       Live:   OO#.. Dead: OO#..  Playing
> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
> > /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
> > ---------------------------------------------------------------------------
> > Sent from my phone. Please excuse my brevity.
> >
> > On October 29, 2015 4:55:19 PM MST, Axel Urbiz <axel.urbiz at gmail.com <mailto:axel.urbiz at gmail.com>>
> > wrote:
> > >Hello,
> > >
> > >Sorry, resending this question as the prior was not sent properly.
> > >
> > >I’m using the plyr package below to add a variable named "bin" to my
> > >original data frame "df" with the user-defined function "create_bins".
> > >I'd
> > >like to get similar results using dplyr instead, but failing to do so.
> > >
> > >set.seed(4)
> > >df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels =
> > >c("model1", "model2")))
> > >
> > >
> > >### Using plyr (works fine)
> > >create_bins <- function(x, nBins)
> > >{
> > >  Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
> > >  dfB <-  data.frame(pred = x$pred,
> > >                    bin = cut(x$pred, breaks = Breaks, include.lowest =
> > >TRUE))
> > >  dfB
> > >}
> > >
> > >nBins = 10
> > >res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
> > >head(res_plyr)
> > >
> > >### Using dplyr (fails)
> > >
> > >by_group <- dplyr::group_by(df, models)
> > >res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
> > >Error: not a vector
> > >
> > >
> > >Any help would be much appreciated.
> > >
> > >Best,
> > >Axel.
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> >
> 
>         [[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org <mailto:R-help at r-project.org> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help <https://stat.ethz.ch/mailman/listinfo/r-help>
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html <http://www.r-project.org/posting-guide.html>
> and provide commented, minimal, self-contained, reproducible code.
> 
> 


	[[alternative HTML version deleted]]



More information about the R-help mailing list