[Rd] Mapping parse tree elements to tokens

Wed Jul 29 22:10:50 CEST 2015

I agree that we don't want to depend on implementation details. Some
sort of abstraction that is higher resolution than srcrefs would be
nice. Right now, it would be inconvenient using srcrefs to get to the
exact column range of a symbol, for example, but an IDE wants that to
highlight the symbol. Maybe looking at how other parsers represent
this information would be helpful.

On Wed, Jul 29, 2015 at 12:15 PM,  <luke-tierney at uiowa.edu> wrote:
> Both codetools and compiler get by without this. codetools uses source
> refs to generate messages; I don't recall if compiler does but it
> could easily do so. I would be wary about committing to this sort of
> implementation specific stuff -- we might want to go to completely
> different parser technology at tome point, which would be harder if we
> committed to these sort of details.
>
> Best,
>
> luke
>
> On Wed, 29 Jul 2015, Michael Lawrence wrote:
>
>> I have two use cases in mind:
>>
>> 1) Code indexing/searching, where the table gets me almost all of the
>> way there, except I ask for all of the text (including the calls) and
>> then parse that, because it's nice to get back an actual code object
>> when you are searching code (in addition to where the code lives). The
>> extra parsing step is just a minor inconvenience.
>>
>> 2) Code analysis, which I'm pretty sure is also Jim's use case, where
>> the analysis is implemented most easily as a parse tree traversal,
>> while you also want to point back to the original source location.
>> Here's where one would want a reference from parse node to location.
>>
>> So neither of those involves code evaluation at first glance, though I
>> guess one could use some sort of evaluation during analysis.
>>
>> On Wed, Jul 29, 2015 at 11:47 AM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>>>
>>> On 29/07/2015 2:30 PM, Michael Lawrence wrote:
>>>>
>>>>
>>>> Probably need a generic tree based on "ParseNode" objects that
>>>> associate the line information with the symbol (for leaf nodes). As
>>>> Duncan notes, it should be possible to gather that from the table.
>>>>
>>>> But it would be nice if there was an "expr" column in the parse data
>>>> column in addition to "text". It would contain the parsed object.
>>>> Otherwise, to use the table, one is often reparsing the text, which
>>>> just seems redundant and inconvenient.
>>>
>>>
>>>
>>> Can you (both Jim and Michael) describe the uses you might have for this?
>>> There are lots of possible changes that could make this information
>>> available:
>>>
>>>  - attach to each item in the parse tree, as the parser package did.
>>> (Bad
>>> idea for general use which is why I dropped it, but
>>> it could be done as a special option to parse, if you aren't planning to
>>> evaluate the expression.)
>>>  - give the index into the parse tree of each item, i.e. c(1,1), c(1,2),
>>> c(1,3) in the example below, or just 1,2,3 along with a function to
>>> reconstruct the full path.
>>>  - give a copy of the branch of the parse tree, as Michael suggests.
>>>
>>> etc.  Which is best for your purposes?
>>>
>>> Duncan Murdoch
>>>
>>>>
>>>> Michael
>>>>
>>>> On Wed, Jul 29, 2015 at 9:43 AM, Duncan Murdoch
>>>> <murdoch.duncan at gmail.com> wrote:
>>>> > On 29/07/2015 12:13 PM, Jim Hester wrote:
>>>> >>
>>>> >> I would like to map the parsed tokens obtained from
>>>> >> utils::getParseData()
>>>> >> to the parse tree and elements obtained by base::parse().
>>>> >>
>>>> >> It looks like back when this code was in the parser package the
>>>> >> parse()
>>>> >> function annotated the elements in the tree with their id, which
>>>> >> would
>>>> >> allow you to perform this mapping.  However when the code was
>>>> >> included
>>>> >> in
>>>> >> R
>>>> >> this functionality was removed.
>>>> >
>>>> >
>>>> > Yes, not all elements of the parse tree can legally have attributes
>>>> > attached.
>>>> >>
>>>> >>
>>>> >> ?getParseData states
>>>> >>    The ‘id’ values are not attached to the elements of the parse
>>>> >>            tree, they are only retained in the table returned by
>>>> >>            ‘getParseData’.
>>>> >>
>>>> >> Is there another way you can map between the getParseData() tokens
>>>> >> and
>>>> >> elements of the parse tree that makes this additional annotation
>>>> >> unnecessary?  Or is this simply not possible?
>>>> >
>>>> >
>>>> > I think you can't get to it, though you can get close by looking at
>>>> > the
>>>> > id &
>>>> > parent values in the table.  For example,
>>>> >
>>>> >  code <- "x + (y + 1)"
>>>> >  p <- parse(text=code)
>>>> >
>>>> > getParseData(p)
>>>> >    line1 col1 line2 col2 id parent     token terminal text
>>>> > 15     1    1     1   11 15      0      expr    FALSE
>>>> > 1      1    1     1    1  1      3    SYMBOL     TRUE    x
>>>> > 3      1    1     1    1  3     15      expr    FALSE
>>>> > 2      1    3     1    3  2     15       '+'     TRUE    +
>>>> > 13     1    5     1   11 13     15      expr    FALSE
>>>> > 4      1    5     1    5  4     13       '('     TRUE    (
>>>> > 11     1    6     1   10 11     13      expr    FALSE
>>>> > 5      1    6     1    6  5      7    SYMBOL     TRUE    y
>>>> > 7      1    6     1    6  7     11      expr    FALSE
>>>> > 6      1    8     1    8  6     11       '+'     TRUE    +
>>>> > 8      1   10     1   10  8      9 NUM_CONST     TRUE    1
>>>> > 9      1   10     1   10  9     11      expr    FALSE
>>>> > 10     1   11     1   11 10     13       ')'     TRUE    )
>>>> >
>>>> >
>>>> > Now p is an expression, with the parse tree in p[[1]].  From the
>>>> > table,
>>>> > we
>>>> > can see that the root node has id 15, and 3 nodes have that as a
>>>> > parent.
>>>> > Those would be p[[c(1,1)]], p[[c(1,2)]], p[[c(1,3)]].  The tricky part
>>>> > is
>>>> > the re-ordering:  those correspond to `+`, x, and (y+1) respectively,
>>>> > not
>>>> > the order they appear in the original source or in the table.
>>>> > Generally
>>>> > the
>>>> > function call appears first in the parse tree, but I'm not sure you
>>>> > could
>>>> > always recognize which is the function call by looking at the table.
>>>> >
>>>> > Duncan Murdoch
>>>> >
>>>> > ______________________________________________
>>>> > R-devel at r-project.org mailing list
>>>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
> --
> Luke Tierney
> Ralph E. Wareham Professor of Mathematical Sciences
> University of Iowa                  Phone:             319-335-3386
> Department of Statistics and        Fax:               319-335-3017
>    Actuarial Science
> 241 Schaeffer Hall                  email:   luke-tierney at uiowa.edu
> Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu