[R] User-defined functions in dplyr

William Dunlap wdunlap at tibco.com
Tue Nov 3 02:58:17 CET 2015


dplyr::mutate does not collapse factor variables well.  They seem to get
their levels from the levels
computed for the first group and mutate does not check for them having
different levels.

> data.frame(group=rep(c("A","B","C"),each=2),
value=rep(c("X","Y","Z"),3:1)) %>% dplyr::group_by(group) %>%
dplyr::mutate(fv=factor(value))
Source: local data frame [6 x 3]
Groups: group [3]

   group  value     fv
  (fctr) (fctr) (fctr)
1      A      X      X
2      A      X      X
3      B      X      X
4      B      Y     NA
5      C      Y      X
6      C      Z     NA
> levels(.Last.value$fv)
[1] "X"



Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Nov 2, 2015 at 5:38 PM, Axel Urbiz <axel.urbiz at gmail.com> wrote:

> Actually, the results are not the same. Looks like in the code below (see
> "using dplyr”), the function create_bins2 is not being applied separately
> to each "group_by" variable. That is surprising to me, or I'm
> misunderstanding dplyr.
>
> ### Create some data
>
> set.seed(4)
> df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels =
> c("model1", "model2")))
>
> ### This is the code using plyr, which I'd like to change using dplyr
>
> create_bins <- function(x, nBins) {
>   Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>   dfB <-  data.frame(pred = x$pred,
>                                 bin = cut(x$pred, breaks = Breaks,
> include.lowest = TRUE))
>   dfB
> }
>
> nBins = 10
> res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
> head(res_plyr)
>
> ### Attempt using dplyr
>
> create_bins2 <- function (pred, nBins) {
>   Breaks <- unique(quantile(pred, probs = seq(0, 1, 1/nBins)))
>   bin <- cut(pred, breaks = Breaks, include.lowest = TRUE)
>   bin
> }
>
> res_dplyr <- dplyr::mutate(dplyr::group_by(df, models),
>                                           bin=create_bins2(pred, nBins))
>
>
> identical(res_plyr, as.data.frame(res_dplyr))
> [1] FALSE
> #levels(res_dplyr$bin) == levels(res_plyr$bin)
>
> Thanks,
> Axel.
>
>
>
> On Oct 30, 2015, at 12:19 PM, William Dunlap <wdunlap at tibco.com> wrote:
>
> dplyr::mutate is probably what you want instead of dplyr::summarize:
>
> create_bins3 <- function (xpred, nBins)
> {
>     Breaks <- unique(quantile(xpred, probs = seq(0, 1, 1/nBins)))
>     bin <- cut(xpred, breaks = Breaks, include.lowest = TRUE)
>     bin
> }
> dplyr::group_by(df, models) %>% dplyr::mutate(Bin=create_bins3(pred,nBins))
> #Source: local data frame [100 x 3]
> #Groups: models [2]
> #
> #         pred models               Bin
> #        (dbl) (fctr)            (fctr)
> #1   0.2167549 model1     (0.167,0.577]
> #2  -0.5424926 model1   (-0.869,-0.481]
> ...
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Fri, Oct 30, 2015 at 9:06 AM, William Dunlap <wdunlap at tibco.com> wrote:
>
>> The error message is not very helpful and the stack trace is pretty
>> inscrutable as well
>> > dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
>> Error: not a vector
>> > traceback()
>> 14: stop(list(message = "not a vector", call = NULL, cppstack = NULL))
>> 13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
>> 12: summarise_impl(.data, dots)
>> 11: summarise_.tbl_df(.data, .dots = lazyeval::lazy_dots(...))
>> 10: summarise_(.data, .dots = lazyeval::lazy_dots(...))
>> 9: dplyr::summarize(., create_bins)
>> 8: function_list[[k]](value)
>> 7: withVisible(function_list[[k]](value))
>> 6: freduce(value, `_function_list`)
>> 5: `_fseq`(`_lhs`)
>> 4: eval(expr, envir, enclos)
>> 3: eval(quote(`_fseq`(`_lhs`)), env, env)
>> 2: withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
>> 1: dplyr::group_by(df, models) %>% dplyr::summarize(create_bins)
>>
>>
>> It does not mean that your function, create_bins, does not return a
>> vector --
>> the sum function gives the same result. help(summarize,package="dplyr")
>> says:
>>      ...: Name-value pairs of summary functions like ‘min()’, ‘mean()’,
>>           ‘max()’ etc.
>> It apparently means calls to summary functions, not summary functions
>> themselves.  The examples in the help file show the proper usage.
>>
>> Use a call to your function and you will see it works better
>>    > dplyr::group_by(df, models) %>%
>> dplyr::summarize(create_bins(pred,nBins))
>>    Error: $ operator is invalid for atomic vectors
>> The traceback again is not very useful, because the call information was
>> stripped by dplyr (by the call=NULL in the call to stop()):
>>   > traceback()
>>   14: stop(list(message = "$ operator is invalid for atomic vectors",
>>           call = NULL, cppstack = NULL))
>>   13: .Call("dplyr_summarise_impl", PACKAGE = "dplyr", df, dots)
>> However it is clear that the fault is in your function, which is
>> expecting a
>> data.frame x with a column called pred but gets pred itself.  Change x to
>> xpred
>> in the argument list and x$pred to xpred in the body of the function.
>>
>> You will run into more problems because your function returns a vector
>> the length of its input but summarize expects a summary function - one
>> that returns a scalar for any size vector input.
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Fri, Oct 30, 2015 at 4:04 AM, Axel Urbiz <axel.urbiz at gmail.com> wrote:
>>
>>> So in this case, "create_bins" returns a vector and I still get the same
>>> error.
>>>
>>>
>>> create_bins <- function(x, nBins)
>>> {
>>>   Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>>>   bin <- cut(x$pred, breaks = Breaks, include.lowest = TRUE)
>>>   bin
>>> }
>>>
>>>
>>> ### Using dplyr (fails)
>>> nBins = 10
>>> by_group <- dplyr::group_by(df, models)
>>> res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
>>> Error: not a vector
>>>
>>> On Thu, Oct 29, 2015 at 8:28 PM, Jeff Newmiller <
>>> jdnewmil at dcn.davis.ca.us>
>>> wrote:
>>>
>>> > You are jumping the gun (your other email did get through) and you are
>>> > posting using HTML (which does not come through on the list). Some time
>>> > (re)reading the Posting Guide mentioned at the bottom of all emails on
>>> this
>>> > list seems to be in order.
>>> >
>>> > The error is actually quite clear. You should return a vector from your
>>> > function, not a data frame.
>>> >
>>> ---------------------------------------------------------------------------
>>> > Jeff Newmiller                        The     .....       .....  Go
>>> Live...
>>> > DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live
>>> > Go...
>>> >                                       Live:   OO#.. Dead: OO#..
>>> Playing
>>> > Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
>>> > /Software/Embedded Controllers)               .OO#.       .OO#.
>>> rocks...1k
>>> >
>>> ---------------------------------------------------------------------------
>>> > Sent from my phone. Please excuse my brevity.
>>> >
>>> > On October 29, 2015 4:55:19 PM MST, Axel Urbiz <axel.urbiz at gmail.com>
>>> > wrote:
>>> > >Hello,
>>> > >
>>> > >Sorry, resending this question as the prior was not sent properly.
>>> > >
>>> > >I’m using the plyr package below to add a variable named "bin" to my
>>> > >original data frame "df" with the user-defined function "create_bins".
>>> > >I'd
>>> > >like to get similar results using dplyr instead, but failing to do so.
>>> > >
>>> > >set.seed(4)
>>> > >df <- data.frame(pred = rnorm(100), models = gl(2, 50, 100, labels =
>>> > >c("model1", "model2")))
>>> > >
>>> > >
>>> > >### Using plyr (works fine)
>>> > >create_bins <- function(x, nBins)
>>> > >{
>>> > >  Breaks <- unique(quantile(x$pred, probs = seq(0, 1, 1/nBins)))
>>> > >  dfB <-  data.frame(pred = x$pred,
>>> > >                    bin = cut(x$pred, breaks = Breaks, include.lowest
>>> =
>>> > >TRUE))
>>> > >  dfB
>>> > >}
>>> > >
>>> > >nBins = 10
>>> > >res_plyr <- plyr::ddply(df, plyr::.(models), create_bins, nBins)
>>> > >head(res_plyr)
>>> > >
>>> > >### Using dplyr (fails)
>>> > >
>>> > >by_group <- dplyr::group_by(df, models)
>>> > >res_dplyr <- dplyr::summarize(by_group, create_bins, nBins)
>>> > >Error: not a vector
>>> > >
>>> > >
>>> > >Any help would be much appreciated.
>>> > >
>>> > >Best,
>>> > >Axel.
>>> > >
>>> > >       [[alternative HTML version deleted]]
>>> > >
>>> > >______________________________________________
>>> > >R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> > >https://stat.ethz.ch/mailman/listinfo/r-help
>>> > >PLEASE do read the posting guide
>>> > >http://www.R-project.org/posting-guide.html
>>> <http://www.r-project.org/posting-guide.html>
>>> > >and provide commented, minimal, self-contained, reproducible code.
>>> >
>>> >
>>>
>>>         [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> <http://www.r-project.org/posting-guide.html>
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list