= rmd_ast(list(
ast rmd_yaml(list(title = "Example Document")),
rmd_heading(name = "Introduction", level = 1L),
rmd_markdown(lines = "This is some text."),
rmd_chunk(
engine = "r",
code = c("x <- 1:5", "mean(x)")
) ))
The parsermd package parses R Markdown and Quarto documents into an Abstract Syntax Tree (AST) representation. This vignette introduces the different types of AST nodes and their properties, helping you understand how parsermd represents document structure.
rmd_ast
The rmd_ast
object serves as the container for all parsed document nodes. It holds a linear sequence of nodes representing different document elements, where each node type corresponds to a specific R Markdown or Quarto construct (headings, code chunks, text, etc.).
Important: The AST represents documents as a linear sequence of nodes, not a nested tree structure. This means that structural elements like fenced divs are represented as separate opening and closing nodes in the sequence, rather than as nodes with children.
The default print method for rmd_ast
’s (flat = FALSE
) presents an implicit tree structure based on heading levels. This provides a hierarchical view that reflects the document’s logical organization, where content is grouped under headings based on their level.
Properties:
nodes
: A list containing all the parsed nodes in document orderExample:
Raw text that would be parsed:
---
title: "Example Document"
---
# Introduction
This is some text.
```{r}
<- 1:5
x mean(x)
```
This would create an rmd_ast
object containing:
rmd_yaml
node with the titlermd_heading
node with “Introduction”rmd_markdown
node with “This is some text.”rmd_chunk
node with the R codeProgrammatic creation:
= rmd_ast(list(
ast rmd_yaml(list(title = "Example Document")),
rmd_heading(name = "Introduction", level = 1L),
rmd_markdown(lines = "This is some text."),
rmd_chunk(
engine = "r",
code = c("x <- 1:5", "mean(x)")
) ))
Hierarchical view (flat = FALSE
):
print(ast, flat = FALSE)
#> ├── YAML [1 field]
#> └── Heading [h1] - Introduction
#> ├── Markdown [1 line]
#> └── Chunk [r, 2 lines] -
Linear view (flat = TRUE
):
print(ast, flat = TRUE)
#> ├── YAML [1 field]
#> ├── Heading [h1] - Introduction
#> ├── Markdown [1 line]
#> └── Chunk [r, 2 lines] -
parsermd uses the S7 object system for all AST node types. S7 provides a modern, robust class system with:
Key S7 Features in parsermd:
rmd_node
class@
syntax (e.g., node@content
)Property Access:
# Create a heading node
= rmd_heading(name = "Section Title", level = 2L)
heading
# Access properties with @
@name
heading#> [1] "Section Title"
@level
heading#> [1] 2
rmd_yaml
The rmd_yaml
node represents YAML front matter at the beginning of documents.
Properties:
yaml
: List containing the parsed YAML contentExample:
Raw text that would be parsed:
---
title: "My Document"
author: "John Doe"
date: "2023-01-01"
---
Programmatic creation:
= rmd_yaml(list(
yaml_node title = "My Document",
author = "John Doe",
date = "2023-01-01"
))
yaml_node#> <rmd_yaml>
#> @ yaml:List of 3
#> .. $ title : chr "My Document"
#> .. $ author: chr "John Doe"
#> .. $ date : chr "2023-01-01"
rmd_heading
The rmd_heading
node represents section headings in markdown.
Properties:
name
: Character string containing the heading textlevel
: Integer from 1-6 indicating the heading level (# = 1, ## = 2, etc.)Example:
Raw text that would be parsed:
# Introduction
Programmatic creation:
= rmd_heading(
heading_node name = "Introduction",
level = 1L
)
heading_node#> <rmd_heading>
#> @ name : chr "Introduction"
#> @ level: int 1
rmd_markdown
The rmd_markdown
node represents plain markdown text content.
Properties:
lines
: Character vector containing the markdown text linesExample:
Raw text that would be parsed:
This is a paragraph. With multiple lines.
Programmatic creation:
= rmd_markdown(
markdown_node lines = c("This is a paragraph.", "With multiple lines.")
)
markdown_node#> <rmd_markdown>
#> @ lines: chr [1:2] "This is a paragraph." "With multiple lines."
rmd_chunk
The rmd_chunk
node represents executable code chunks with options and metadata.
Properties:
engine
: The code engine (default: “r”)label
: Optional chunk name/labeloptions
: List of chunk options containing both traditional and YAML optionscode
: Character vector containing the code linesindent
: Indentation stringn_ticks
: Number of backticks used (default: 3)Chunk Option Formats:
Chunks support two option formats that can be used independently or together:
Traditional format: Options specified in the chunk header after the engine and label ```{{r chunk-label, eval=TRUE, echo=FALSE}}
YAML format: Options specified as YAML comments within the chunk
```{r chunk-label}
#| eval: true
#| echo: false
```
Option Conflict Resolution:
When the same option is specified in both formats, YAML options take precedence over traditional options. A warning is emitted when conflicts occur:
{r eval=TRUE} #| eval: false
In this case, eval: false
(YAML) wins over eval=TRUE
(traditional), and the parser emits: “YAML options override traditional options for: eval”
Type Handling:
"TRUE"
, "5"
)TRUE
, 5L
, 3.14
)Examples:
Traditional format chunk:
```{r example, eval=TRUE, echo=FALSE}
<- 1:10
x mean(x)
```
YAML format chunk:
```{r example}
#| eval: true
#| echo: false
<- 1:10
x mean(x)
```
Mixed format chunk (with conflict):
```{r example, eval=TRUE}
#| eval: false
#| message: false
<- 1:10
x mean(x)
```
In this case, eval: false
(YAML) overrides eval=TRUE
(traditional).
Programmatic creation:
# Traditional-style options
= rmd_chunk(
chunk_node_traditional engine = "r",
label = "example",
options = list(eval = "TRUE", echo = "FALSE"),
code = c("x <- 1:10", "mean(x)")
)
# YAML-style options with proper types
= rmd_chunk(
chunk_node_yaml engine = "r",
label = "example",
options = list(eval = TRUE, echo = FALSE),
code = c("x <- 1:10", "mean(x)")
)
chunk_node_yaml#> <rmd_chunk>
#> @ engine : chr "r"
#> @ label : chr "example"
#> @ options:List of 2
#> .. $ eval: logi TRUE
#> .. $ echo: logi FALSE
#> @ code : chr [1:2] "x <- 1:10" "mean(x)"
#> @ indent : chr ""
#> @ n_ticks: int 3
rmd_raw_chunk
The rmd_raw_chunk
node represents raw output chunks for specific formats.
Properties:
format
: The output format (e.g., “html”, “latex”)code
: Character vector containing the raw contentindent
: Indentation stringn_ticks
: Number of backticks usedExample:
Raw text that would be parsed:
```{=html}
<div class='custom'>
<p>Custom HTML content</p>
</div>
```
Programmatic creation:
= rmd_raw_chunk(
raw_chunk_node format = "html",
code = c(
"<div class='custom'>",
" <p>Custom HTML content</p>",
"</div>"
)
)
raw_chunk_node#> <rmd_raw_chunk>
#> @ format : chr "html"
#> @ code : chr [1:3] "<div class='custom'>" " <p>Custom HTML content</p>" ...
#> @ indent : chr ""
#> @ n_ticks: int 3
rmd_code_block
The rmd_code_block
node represents non-executable fenced code blocks.
Properties:
id
: Optional HTML ID attributeclasses
: Character vector of CSS classes (e.g., language names like “python”, “r”)attr
: Named character vector for key-value attributes (e.g., c(style="color:blue")
)code
: Character vector containing the code linesindent
: Indentation stringn_ticks
: Number of backticks usedExample:
Raw text that would be parsed:
```python
def hello():
print('Hello, World!')
```
Programmatic creation:
= rmd_code_block(
code_block_node classes = c("python"),
code = c(
"def hello():",
" print('Hello, World!')"
)
)
code_block_node#> <rmd_code_block>
#> @ id : chr(0)
#> @ classes: chr "python"
#> @ attr : chr(0)
#> @ code : chr [1:2] "def hello():" " print('Hello, World!')"
#> @ indent : chr ""
#> @ n_ticks: int 3
rmd_code_block_literal
The rmd_code_block_literal
node represents code blocks with literal attribute capture using the {...}
syntax. This format preserves the raw attribute content exactly as written, making it ideal for displaying code chunk examples.
Properties:
attr
: Raw attribute string (exactly as written between {{
and }}
)code
: Character vector containing the code linesindent
: Indentation stringn_ticks
: Number of backticks usedExample:
Raw text that would be parsed: {r, echo=TRUE, eval=FALSE} x <- 1:10 mean(x)
Programmatic creation:
= rmd_code_block_literal(
code_block_literal_node attr = "r, echo=TRUE, eval=FALSE",
code = c(
"x <- 1:10",
"mean(x)"
)
)
code_block_literal_node#> <rmd_code_block_literal>
#> @ attr : chr "r, echo=TRUE, eval=FALSE"
#> @ code : chr [1:2] "x <- 1:10" "mean(x)"
#> @ indent : chr ""
#> @ n_ticks: int 3
Nested Braces Support:
The literal format can handle nested braces in attributes: {{r, code='function() { return(1) }'}}
This captures the attribute as: "r, code='function() { return(1) }'"
rmd_fenced_div_open
& rmd_fenced_div_close
Fenced divs are represented as pairs of nodes in the linear AST sequence. The rmd_fenced_div_open
node marks the beginning of a fenced div block, and the rmd_fenced_div_close
node marks the end. Any content between these nodes is considered to be inside the fenced div.
rmd_fenced_div_open Properties:
id
: Optional HTML ID attributeclasses
: Character vector of CSS classesattr
: Named character vector for key-value attributesrmd_fenced_div_close Properties: None (just a marker)
Example:
Raw text that would be parsed:
::: {.warning #important}
This content is inside the fenced div.
More content here. :::
This would create a sequence of nodes: 1. rmd_fenced_div_open
with attributes 2. rmd_markdown
with “This content is inside the fenced div.” 3. rmd_markdown
with “More content here.” 4. rmd_fenced_div_close
Programmatic creation:
# Create the opening node
= rmd_fenced_div_open(
fenced_div_open_node classes = c(".warning"),
attr = c(id = "important")
)
# Create the closing node
= rmd_fenced_div_close()
fenced_div_close_node
# These would typically be combined with content nodes in an rmd_ast
= rmd_ast(list(
ast_with_div
fenced_div_open_node,rmd_markdown(
lines = "This content is inside the fenced div."
),rmd_markdown(
lines = "More content here."
),
fenced_div_close_node ))
The following classes represent elements that can be extracted from AST nodes through secondary parsing, rather than being direct nodes in the AST structure. These elements are found within markdown text and code content.
rmd_inline_code
The rmd_inline_code
class represents inline code expressions extracted from markdown text.
Properties:
engine
: The code engine (empty string for static code)code
: The inline code contentbraced
: Whether the code uses braced syntaxstart
: Starting position in the source textlength
: Length of the inline codeExample:
Raw text containing inline code:
The result is 4.
Programmatic creation:
# Create directly
= rmd_inline_code(
inline_code_obj engine = "r",
code = "2 + 2",
braced = FALSE
)
inline_code_obj#> rmd_inline_code[-1,-1] `r 2 + 2`
rmd_shortcode
The rmd_shortcode
class represents Quarto shortcode function calls extracted from markdown content.
Properties:
func
: The shortcode function nameargs
: Character vector of argumentsstart
: Starting position in the source textlength
: Length of the shortcodeExample:
Raw text containing a shortcode:
{{< embed type=video src=example.mp4 >}}
Programmatic creation:
# Create directly
= rmd_shortcode(
shortcode_obj func = "embed",
args = c("type=video", "src=example.mp4")
)
shortcode_obj#> rmd_shortcode[-1,-1] {{< embed type=video src=example.mp4 >}}
rmd_span
The rmd_span
class represents inline span elements with attributes extracted from markdown text.
Properties:
text
: The text content of the spanid
: Optional HTML ID (must start with ‘#’ if present)classes
: Character vector of CSS classes (must start with ‘.’ if present)attr
: Named character vector of additional attributesExample:
Raw text containing a span:
[Important text]{.highlight #key}
Programmatic creation:
# Create directly
= rmd_span(
span_obj text = "Important text",
id = c("#key"),
classes = c(".highlight")
)
span_obj#> rmd_span [Important text]{#key .highlight}
These utility functions extract the above elements from AST nodes:
rmd_extract_inline_code()
: Extract inline code from textrmd_extract_shortcodes()
: Extract shortcodes from textrmd_extract_spans()
: Extract spans from textrmd_has_inline_code()
: Check if text contains inline codermd_has_shortcode()
: Check if text contains shortcodesrmd_has_span()
: Check if text contains spans