[Rd] the pipe |> and line breaks in pipelines

Wed Dec 9 21:51:01 CET 2020

   Definitely support the idea that if this kind of trickery is going to 
happen that it be confined to some particular IDE/environment or some 
particular submission protocol. I don't want it to happen in my ESS 
session please ... I'd rather deal with the parentheses.

On 12/9/20 3:45 PM, Timothy Goodman wrote:
> Regarding special treatment for |>, isn't it getting special treatment
> anyway, because it's implemented as a syntax transformation from x |> f(y)
> to f(x, y), rather than as an operator?
> 
> That said, the point about wanting a block of code submitted line-by-line
> to work the same as a block of code submitted all at once is a fair one.
> Maybe the better solution would be if there were a way to say "Submit the
> selected code as a single expression, ignoring line-breaks".  Then I could
> run any number of lines with pipes at the start and no special character at
> the end, and have it treated as a single pipeline.  I suppose that'd need
> to be a feature offered by the environment (RStudio's RNotebooks in my
> case).  I could wrap my pipelines in parentheses (to make the "pipes at
> start of line" syntax valid R code), and then could use the hypothetical
> "submit selected code ignoring line-breaks" feature when running just the
> first part of the pipeline -- i.e., selecting full lines, but starting
> after the opening paren so as not to need to insert a closing paren.
> 
> - Tim
> 
> On Wed, Dec 9, 2020 at 12:12 PM Duncan Murdoch <murdoch.duncan using gmail.com>
> wrote:
> 
>> On 09/12/2020 2:33 p.m., Timothy Goodman wrote:
>>> If I type my_data_frame_1 and press Enter (or Ctrl+Enter to execute the
>>> command in the Notebook environment I'm using) I certainly *would*
>>> expect R to treat it as a complete statement.
>>>
>>> But what I'm talking about is a different case, where I highlight a
>>> multi-line statement in my notebook:
>>>
>>>       my_data_frame1
>>>           |> filter(some_conditions_1)
>>>
>>> and then press Ctrl+Enter.
>>
>> I don't think I'd like it if parsing changed between passing one line at
>> a time and passing a block of lines.  I'd like to be able to highlight a
>> few lines and pass those, then type one, then highlight some more and
>> pass those:  and have it act as though I just passed the whole combined
>> block, or typed everything one line at a time.
>>
>>
>>     Or, I suppose the equivalent would be to run
>>> an R script containing those two lines of code, or to run a multi-line
>>> statement like that from the console (which in RStudio I can do by
>>> pressing Shift+Enter between the lines.)
>>>
>>> In those cases, R could either (1) Give an error message [the current
>>> behavior], or (2) understand that the first line is meant to be piped to
>>> the second.  The second option would be significantly more useful, and
>>> is almost certainly what the user intended.
>>>
>>> (For what it's worth, there are some languages, such as Javascript, that
>>> consider the first token of the next line when determining if the
>>> previous line was complete.  JavaScript's rules around this are overly
>>> complicated, but a rule like "a pipe following a line break is treated
>>> as continuing the previous line" would be much simpler.  And while it
>>> might be objectionable to treat the operator %>% different from other
>>> operators, the addition of |>, which isn't truly an operator at all,
>>> seems like the right time to consider it.)
>>
>> I think this would be hard to implement with R's current parser, but
>> possible.  I think it could be done by distinguishing between EOL
>> markers within a block of text and "end of block" marks.  If it applied
>> only to the |> operator it would be *really* ugly.
>>
>> My strongest objection to it is the one at the top, though.  If I have a
>> block of lines sitting in my editor that I just finished executing, with
>> the cursor pointing at the next line, I'd like to know that it didn't
>> matter whether the lines were passed one at a time, as a block, or some
>> combination of those.
>>
>> Duncan Murdoch
>>
>>>
>>> -Tim
>>>
>>> On Wed, Dec 9, 2020 at 3:12 AM Duncan Murdoch <murdoch.duncan using gmail.com
>>> <mailto:murdoch.duncan using gmail.com>> wrote:
>>>
>>>      The requirement for operators at the end of the line comes from the
>>>      interactive nature of R.  If you type
>>>
>>>            my_data_frame_1
>>>
>>>      how could R know that you are not done, and are planning to type the
>>>      rest of the expression
>>>
>>>              %>% filter(some_conditions_1)
>>>              ...
>>>
>>>      before it should consider the expression complete?  The way languages
>>>      like C do this is by requiring a statement terminator at the end.
>> You
>>>      can also do it by wrapping the entire thing in parentheses ().
>>>
>>>      However, be careful: Don't use braces:  they don't work.  And parens
>>>      have the side effect of removing invisibility from the result (which
>> is
>>>      a design flaw or bonus, depending on your point of view).  So I
>>>      actually
>>>      wouldn't advise this workaround.
>>>
>>>      Duncan Murdoch
>>>
>>>
>>>      On 09/12/2020 12:45 a.m., Timothy Goodman wrote:
>>>       > Hi,
>>>       >
>>>       > I'm a data scientist who routinely uses R in my day-to-day work,
>>>      for tasks
>>>       > such as cleaning and transforming data, exploratory data
>>>      analysis, etc.
>>>       > This includes frequent use of the pipe operator from the magrittr
>>>      and dplyr
>>>       > libraries, %>%.  So, I was pleased to hear about the recent work
>> on a
>>>       > native pipe operator, |>.
>>>       >
>>>       > This seems like a good time to bring up the main pain point I
>>>      encounter
>>>       > when using pipes in R, and some suggestions on what could be done
>>>      about
>>>       > it.  The issue is that the pipe operator can't be placed at the
>>>      start of a
>>>       > line of code (except in parentheses).  That's no different than
>>>      any binary
>>>       > operator in R, but I find it's a source of difficulty for the
>>>      pipe because
>>>       > of how pipes are often used.
>>>       >
>>>       > [I'm assuming here that my usage is fairly typical of a lot of
>>>      users; at
>>>       > any rate, I don't think I'm *too* unusual.]
>>>       >
>>>       > === Why this is a problem ===
>>>       >
>>>       > It's very common (for me, and I suspect for many users of dplyr)
>>>      to write
>>>       > multi-step pipelines and put each step on its own line for
>>>      readability.
>>>       > Something like this:
>>>       >
>>>       >    ### Example 1 ###
>>>       >    my_data_frame_1 %>%
>>>       >      filter(some_conditions_1) %>%
>>>       >      inner_join(my_data_frame_2, by = some_columns_1) %>%
>>>       >      group_by(some_columns_2) %>%
>>>       >      summarize(some_aggregate_functions_1) %>%
>>>       >      filter(some_conditions_2) %>%
>>>       >      left_join(my_data_frame_3, by = some_columns_3) %>%
>>>       >      group_by(some_columns_4) %>%
>>>       >      summarize(some_aggregate_functions_2) %>%
>>>       >      arrange(some_columns_5)
>>>       >
>>>       > [I guess some might consider this an overly long pipeline; for me
>>>      it's
>>>       > pretty typical.  I *could* split it up by assigning intermediate
>>>      results to
>>>       > variables, but much of the value I get from the pipe is that it
>>>      lets my
>>>       > code communicate which results are temporary, and which will be
>>>      used again
>>>       > later.  Assigning variables for single-use results would remove
>> that
>>>       > expressiveness.]
>>>       >
>>>       > I would prefer (for reasons I'll explain) to be able to write the
>>>      above
>>>       > example like this, which isn't valid R:
>>>       >
>>>       >    ### Example 2 (not valid R) ###
>>>       >    my_data_frame_1
>>>       >      %>% filter(some_conditions_1)
>>>       >      %>% inner_join(my_data_frame_2, by = some_columns_1)
>>>       >      %>% group_by(some_columns_2)
>>>       >      %>% summarize(some_aggregate_functions_1)
>>>       >      %>% filter(some_conditions_2)
>>>       >      %>% left_join(my_data_frame_3, by = some_columns_3)
>>>       >      %>% group_by(some_columns_4)
>>>       >      %>% summarize(some_aggregate_functions_2)
>>>       >      %>% arrange(some_columns_5)
>>>       >
>>>       > One (minor) advantage is obvious: It lets you easily line up the
>>>      pipes,
>>>       > which means that you can see at a glance that the whole block is
>>>      a single
>>>       > pipeline, and you'd immediately notice if you inadvertently
>>>      omitted a pipe,
>>>       > which otherwise can lead to confusing output.  [It's also
>>>      aesthetically
>>>       > pleasing, especially when %>% is replaced with |>, but that's
>>>      subjective.]
>>>       >
>>>       > But the bigger issue happens when I want to re-run just *part* of
>> the
>>>       > pipeline.  I do this often when debugging: if the output of the
>>>      pipeline
>>>       > seems wrong, I re-run the first few steps and check the output,
>> then
>>>       > include a little more and re-run again, etc., until I locate my
>>>      mistake.
>>>       > Working in an interactive notebook environment, this involves
>>>      using the
>>>       > cursor to select just the part of the code I want to re-run.
>>>       >
>>>       > It's fast and easy to select *entire* lines of code, but
>>>      unfortunately with
>>>       > the pipes placed at the end of the line I must instead select
>>>      everything
>>>       > *except* the last three characters of the line (the last two
>>>      characters for
>>>       > the new pipe).  Then when I want to re-run the same partial
>>>      pipeline with
>>>       > the next line of code included, I can't just press SHIFT+Down to
>>>      select it
>>>       > as I otherwise would, but instead must move the cursor
>>>      horizontally to a
>>>       > position three characters before the end of *that* line (which is
>>>      generally
>>>       > different due to varying line lengths).  And so forth each time I
>>>      want to
>>>       > include an additional line.
>>>       >
>>>       > Moreover, with the staggered positions of the pipes at the end of
>>>      each
>>>       > line, it's very easy to accidentally select the final pipe on a
>>>      line, and
>>>       > then sit there for a moment wondering if the environment has
>> stopped
>>>       > responding before realizing it's just waiting for further input
>>>      (i.e., for
>>>       > the right-hand side).  These small delays and disruptions add up
>>>      over the
>>>       > course of a day.
>>>       >
>>>       > This desire to select and re-run the first part of a pipeline is
>>>      also the
>>>       > reason why it doesn't suffice to achieve syntax like my "Example
>>>      2" by
>>>       > wrapping the entire pipeline in parentheses.  That's of no use if
>>>      I want to
>>>       > re-run a selection that doesn't include the final close-paren.
>>>       >
>>>       > === Possible Solutions ===
>>>       >
>>>       > I can think of two, but maybe there are others.  The first would
>> make
>>>       > "Example 2" into valid code, and the second would allow you to
>> run a
>>>       > selection that included a trailing pipe.
>>>       >
>>>       >    Solution 1: Add a special case to how R is parsed, so if the
>> first
>>>       > (non-whitespace) token after an end-line is a pipe, that pipe
>>>      gets moved to
>>>       > before the end-line.
>>>       >      - Argument for: This lets you write code like example 2,
>> which
>>>       > addresses the pain point around re-running part of a pipeline,
>>>      and has
>>>       > advantages for readability.  Also, since starting a line with a
>> pipe
>>>       > operator is currently invalid, the change wouldn't break any
>>>      working code.
>>>       >      - Argument against: It would make the behavior of %>%
>>>      inconsistent with
>>>       > that of other binary operators in R.  (However, this objection
>>>      might not
>>>       > apply to the new pipe, |>, which I understand is being
>>>      implemented as a
>>>       > syntax transformation rather than a binary operator.)
>>>       >
>>>       >    Solution 2: Ignore the pipe operator if it occurs as the final
>>>      token of
>>>       > the code being executed.
>>>       >      - Argument for: This would mean the user could select and
>>>      re-run the
>>>       > first few lines of a longer pipeline (selecting *entire* lines),
>>>      avoiding
>>>       > the difficulties described above.
>>>       >      - Argument against: This means that %>% would be valid even
>>>      if it
>>>       > occurred without a right-hand side, which is inconsistent with
>> other
>>>       > operators in R.  (But, as above, this objection might not apply
>>>      to |>.)
>>>       > Also, this solution still doesn't enable the syntax of "Example
>>>      2", with
>>>       > its readability benefit.
>>>       >
>>>       > Thanks for reading this and considering it.
>>>       >
>>>       > - Tim Goodman
>>>       >
>>>       >       [[alternative HTML version deleted]]
>>>       >
>>>       > ______________________________________________
>>>       > R-devel using r-project.org <mailto:R-devel using r-project.org> mailing list
>>>       > https://stat.ethz.ch/mailman/listinfo/r-devel
>>>      <https://stat.ethz.ch/mailman/listinfo/r-devel>
>>>       >
>>>
>>
>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>