[Rd] Why does the lexical analyzer drop comments ?

Romain Francois romain.francois at dbmail.com
Tue Mar 31 14:26:27 CEST 2009


Hi,

Thank you for this (inspired) trick. I am currently in the process of 
extracting out the parser from R (ie the gram.y file) and making a 
custom parser using the same grammar but structuring the output in a 
different manner, more suitable for what the syntax highlighter will need.

You will find the project here: 
http://r-forge.r-project.org/projects/highlight/
Feel free to "request to join" on the project if you feel you can make 
useful contributions.

At the moment, I am concentrating efforts deep down in the parser code, 
but there are other challenges:
- once the expressions are parsed, we will need something that 
investigates to find evidence about function calls, to get an idea of 
where the function is defined (by the user, in a package, ...) . This is 
tricky, and unless you actually evaluate the code, there will be some 
errors made.
- once the evidence is collected, other functions (renderers) will have 
the task to render the evidence using html, latex, rtf, ansi escape 
codes, ... the idea here is to design the system so that other packages 
can implement custom renderers to format the evidence in their markup 
language

Romain

Yihui Xie wrote:
> Hi Romain,
>
> I've been thinking for quite a long time on how to keep comments when
> parsing R code and finally got a trick with inspiration from one of my
> friends, i.e. to mask the comments in special assignments to "cheat" R
> parser
>
> # keep.comment: whether to keep the comments or not
> # keep.blank.line: preserve blank lines or not?
> # begin.comment and end.comment: special identifiers that mark the orignial
> #     comments as 'begin.comment = "#[ comments ]end.comment"'
> #     and these marks will be removed after the modified code is parsed
> tidy.source <- function(source = "clipboard", keep.comment = TRUE,
>     keep.blank.line = FALSE, begin.comment, end.comment, ...) {
>     # parse and deparse the code
>     tidy.block = function(block.text) {
>         exprs = parse(text = block.text)
>         n = length(exprs)
>         res = character(n)
>         for (i in 1:n) {
>             dep = paste(deparse(exprs[i]), collapse = "\n")
>             res[i] = substring(dep, 12, nchar(dep) - 1)
>         }
>         return(res)
>     }
>     text.lines = readLines(source, warn = FALSE)
>     if (keep.comment) {
>         # identifier for comments
>         identifier = function() paste(sample(LETTERS), collapse = "")
>         if (missing(begin.comment))
>             begin.comment = identifier()
>         if (missing(end.comment))
>             end.comment = identifier()
>         # remove leading and trailing white spaces
>         text.lines = gsub("^[[:space:]]+|[[:space:]]+$", "",
>             text.lines)
>         # make sure the identifiers are not in the code
>         # or the original code might be modified
>         while (length(grep(sprintf("%s|%s", begin.comment, end.comment),
>             text.lines))) {
>             begin.comment = identifier()
>             end.comment = identifier()
>         }
>         head.comment = substring(text.lines, 1, 1) == "#"
>         # add identifiers to comment lines to cheat R parser
>         if (any(head.comment)) {
>             text.lines[head.comment] = gsub("\"", "\'",
> text.lines[head.comment])
>             text.lines[head.comment] = sprintf("%s=\"%s%s\"",
>                 begin.comment, text.lines[head.comment], end.comment)
>         }
>         # keep blank lines?
>         blank.line = text.lines == ""
>         if (any(blank.line) & keep.blank.line)
>             text.lines[blank.line] = sprintf("%s=\"%s\"", begin.comment,
>                 end.comment)
>         text.tidy = tidy.block(text.lines)
>         # remove the identifiers
>         text.tidy = gsub(sprintf("%s = \"|%s\"", begin.comment,
>             end.comment), "", text.tidy)
>     }
>     else {
>         text.tidy = tidy.block(text.lines)
>     }
>     cat(paste(text.tidy, collapse = "\n"), "\n", ...)
>     invisible(text.tidy)
> }
>
> The above function can deal with comments which are in single lines, e.g.
>
> f = tempfile()
> writeLines('
>   # rotation of the word "Animation"
> # in a loop; change the angle and color
> # step by step
> for (i in 1:360) {
> # redraw the plot again and again
> plot(1,ann=FALSE,type="n",axes=FALSE)
> # rotate; use rainbow() colors
> text(1,1,"Animation",srt=i,col=rainbow(360)[i],cex=7*i/360)
> # pause for a while
> Sys.sleep(0.01)}
> ', f)
>
> Then parse the code file 'f':
>
>   
>> tidy.source(f)
>>     
> # rotation of the word 'Animation'
> # in a loop; change the angle and color
> # step by step
> for (i in 1:360) {
>     # redraw the plot again and again
>     plot(1, ann = FALSE, type = "n", axes = FALSE)
>     # rotate; use rainbow() colors
>     text(1, 1, "Animation", srt = i, col = rainbow(360)[i], cex = 7 *
>         i/360)
>     # pause for a while
>     Sys.sleep(0.01)
> }
>
> Of course this function has some limitations: it does not support
> inline comments or comments which are inside incomplete code lines.
> Peter's example
>
> f #here
> ( #here
> a #here (possibly)
> = #here
> 1 #this one belongs to the argument, though
> ) #but here as well
>
> will be parsed as
>
> f
> (a = 1)
>
> I'm quite interested in syntax highlighting of R code and saw your
> previous discussions in another posts (with Jose Quesada, etc). I'd
> like to do something for your package if I could be of some help.
>
> Regards,
> Yihui
> --
> Yihui Xie <xieyihui at gmail.com>
> Phone: +86-(0)10-82509086 Fax: +86-(0)10-82509086
> Mobile: +86-15810805877
> Homepage: http://www.yihui.name
> School of Statistics, Room 1037, Mingde Main Building,
> Renmin University of China, Beijing, 100872, China
>
>
>
> 2009/3/21  <romain.francois at dbmail.com>:
>   
>> It happens in the token function in gram.c:
>>
>> Â Â Â  c = SkipSpace();
>> Â Â Â  if (c == '#') c = SkipComment();
>>
>> and then SkipComment goes like that:
>>
>> static int SkipComment(void)
>> {
>> Â Â Â  int c;
>> Â Â Â  while ((c = xxgetc()) != '\n' && c != R_EOF) ;
>> Â Â Â  if (c == R_EOF) EndOfFile = 2;
>> Â Â Â  return c;
>> }
>>
>> which effectively drops comments.
>>
>> Would it be possible to keep the information somewhere ?
>>
>> The source code says this:
>>
>> Â *Â  The function yylex() scans the input, breaking it into
>>  *  tokens which are then passed to the parser.  The lexical
>> Â *Â  analyser maintains a symbol table (in a very messy fashion).
>>
>> so my question is could we use this symbol table to keep track of, say, COMMENT tokens.
>>
>> Why would I even care about that ? I'm writing a package that will
>> perform syntax highlighting of R source code based on the output of the
>> parser, and it seems a waste to drop the comments.
>>
>> An also, when you print a function to the R console, you don't get the comments, and some of them might be useful to the user.
>>
>> Am I mad if I contemplate looking into this ?
>>
>> Romain
>>
>> --
>> Romain Francois
>> Independent R Consultant
>> +33(0) 6 28 91 30 30
>> http://romainfrancois.blog.free.fr
>>
>>     
>
>
>   


-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr



More information about the R-devel mailing list