[R] Using and abusing %>% (was Re: Why can't I access this type?)

Patrick Connolly p_connolly at slingshot.co.nz
Sat Mar 28 08:48:52 CET 2015


On Fri, 27-Mar-2015 at 03:27PM +0100, Henric Winell wrote:

|> On 2015-03-26 07:48, Patrick Connolly wrote:
|> 
|> >On Wed, 25-Mar-2015 at 03:14PM +0100, Henric Winell wrote:
|> >
|> >...
|> >
|> >|> Well...  Opinions may perhaps differ, but apart from '%>%' being
|> >|> butt-ugly it's also fairly slow:
|> >
|> >Beauty, it is said, is in the eye of the beholder.  I'm impressed by
|> >the way using %>% reduces or eliminates complicated nested brackets.
|> 
|> I didn't dispute whether '%>%' may be useful -- I just pointed out

Likewise I didn't dispute that it might not be as fast as other ways,
but I was disputing the claim that it was ugly.

|> that it is slow.  However, it is only part of the problem:
|> 'filter()' and 'select()', although aesthetically pleasing, also
|> seem to be slow:

So not 'butt ugly' like '%>%'?

|> 
....

|> > mb <- microbenchmark(
|> +     f1(), f2(), f3(), f4(), f5(), f6(),
|> +     times = 1000L
|> + )
|> > print(mb, signif = 3L)
|> Unit: microseconds
|>  expr min   lq      mean median   uq  max neval   cld
|>  f1() 115  124  134.8812    129  134 1500  1000 a
|>  f2() 128  141  147.4694    145  151 1520  1000 a
|>  f3() 303  328  344.3175    338  348 1740  1000  b
|>  f4() 458  494  518.0830    510  523 1890  1000   c
|>  f5() 806  848  887.7270    875  894 3510  1000    d
|>  f6() 971 1010 1056.5659   1040 1060 3110  1000     e
|> 
|> So, using '%>%', but leaving 'filter()' and 'select()' out of the
|> equation, as in 'f4()' is only half as bad as the "full" 'dplyr'
|> idiom in 'f6()'.  In this case, since we're talking microseconds,
|> the speed-up is negligible but that *is* beside the point.

Agreed that the more 'dplyr' used the slower it gets but don't agree
that it's an issue except in packages that should be optimized.  The
lack of speed won't stop me using it any more than I'll stop using
dataframes because matrices are much faster than them.  The OP's
example can be done using matrix syntax:

state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE] 


which is more than an order of magnitude faster than subscripting a
dataframe.  See No 4. here:

  microbenchmark(## 1. using subset()
        subset(all.states, all.states$Frost > 150, select = c("state","Frost")),
        ## 2. standard R indexing
        all.states[all.states$Frost > 150, c("state","Frost")],
        ## 3. leave out redundant 'state' column
        all.states[all.states$Frost > 150, "Frost", drop = FALSE],
        ## 4. avoid using 'slow' dataframes altogether
        state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE],
        ## 5. easy, slow way without square brackets or quote marks
        all.states %>% filter(Frost > 150) %>% select(state, Frost),
        times = 1000L
        )

Unit: microseconds
                                                                      expr
  subset(all.states, all.states$Frost > 150, select = c("state", "Frost"))
                   all.states[all.states$Frost > 150, c("state", "Frost")]
                 all.states[all.states$Frost > 150, "Frost", drop = FALSE]
              state.x77[state.x77[, "Frost"] > 150, "Frost", drop = FALSE]
               all.states %>% filter(Frost > 150) %>% select(state, Frost)
      min        lq       mean    median        uq      max neval  cld
  223.960  229.9290  236.16557  232.4060  241.4165  291.083  1000   c 
  177.187  182.6075  203.04666  185.1475  194.4815 7259.760  1000   c 
  125.281  130.4835  135.83826  132.6985  141.7375  210.576  1000  b  
    6.442   10.3860   10.61733   11.0405   11.4855   25.077  1000 a   
 1416.592 1437.7015 1562.91898 1447.5695 1473.4440 9394.071  1000    d
> 

[...]

|> 
|> >In this tiny example it's not obvious but it's very clear if the
|> >objective is to sort the dataframe by three or four columns and
|> >various lots of aggregation then returning a largish number of
|> >consecutive columns, omitting the rest.  It's very easy to see what's
|> >going on without the need for intermediate objects.
|> 
|> Why are you opposed to using intermediate objects?  In this case,

I'm not opposed to intermediate objects nor to dogs.  It's just easier
to keep things tidy without either.


|> as can be seen from 'f3()', it will also have the benefit of being
|> faster than either '%>%' or the "full" 'dplyr' idiom.
|> 
|> >|> [...]
|> >
|> >It's no surprise that instructing a computer in something closer to
|> >human language is an order of magnitude slower.
|> 
|> Certainly not true, at least for compiled languages.  In any case,
|> judging from off-list correspondence, it definitely came as a
|> surprise to some R users...
|> 
|> Given that '%>%' is so heavily marketed through 'dplyr', where the
|> latter is said to provide "blazing fast performance for in-memory
|> data by writing key pieces in C++" and "a fast, consistent tool for
|> working with data frame like objects, both in memory and out of
|> memory", I don't think it's far-fetched to expect that it should be
|> more performant than base R.
|> 

I've never come across 'marketing' of free software.  Evidently that's
a looser use of the word.

...


|> >I spend 3 or 4 orders of magnitude more time writing code than running it.
|> 
|> You and me both.  But that doesn't mean speed is of no or little importance.

I never claimed it was.  Tardiness hasn't yet become an issue for me.
When it does, I'll revert to the old ways.

|> 
|> >It's much more important to me to be able to read and modify than
|> > it is to have it run at optimum speed.
|> 
|> Good for you.  But surely, if this is your goal, nothing beats
|> intermediate objects.  

Nothing except chaining, that is.  I went 16 years without it and now
find it amazing how useful it is.  As they say: You're never too old
to learn.


|>  And like I said, it may still be faster than the 'dplyr' idiom.
|> 
|> >|> Of course, this doesn't matter for interactive one-off use.  But
|> >|> lately I've seen examples of the '%>%' operator creeping into
|> >|> functions in packages.
|> >
|> >That could indicate that %>% is seductively easy to use.  It's
|> >probably true that there are places where it should be done the hard
|> >way.
|> 
|> We all know how easy it is to write ugly and sluggish code in R.
|> But 'foo[i,j]' is neither ugly nor sluggish and certainly not "the
|> hard way."

I meant to put a ':-)' in there. Such adjectives as 'easy' and 'hard' are
relative.  There's little difference in difficulty at each step, but
integrating them and revising later are considerably easier using the
so-called "'dplyr' idiom" -- especially if each link in the chain is
on a separate line.

|> 
|> >|>  However, it would be nice to see a fast pipe operator as part of
|> >|> base R.
|> 
|> Heck, it doesn't even have to be fast as long as it's a bit more
|> elegant than '%>%'.

IMHO, %>% fits in nicely with %/%, %%, and %in%.  Elegance, like
beauty, is in the eye of the beholder.

|> 
|> 
|> Henric Winell
|> 
|> 
|> 
|> >
|> >|>
|> >|>
|> >|> Henric Winell
|> >|>
|> >

-- 
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.   
   ___    Patrick Connolly   
 {~._.~}                   Great minds discuss ideas    
 _( Y )_  	         Average minds discuss events 
(:_~*~_:)                  Small minds discuss people  
 (_)-(_)  	                      ..... Eleanor Roosevelt
	  
~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.~.



More information about the R-help mailing list