[Rd] Suggestion: Custom filename patterns for non-Sweave vignettes

Henrik Bengtsson hb at biostat.ucsf.edu
Sat Feb 16 22:55:09 CET 2013


Hi,

as said at the end, all comments are now in the light of R 3.x.0 (x > 0).


On Fri, Feb 15, 2013 at 11:30 AM, Duncan Murdoch
<murdoch.duncan at gmail.com> wrote:
> On 13-02-15 1:53 PM, Henrik Bengtsson wrote:
>>
>> Hi Duncan,
>>
>> thanks you for your prompt reply.
>>
>>
>> On Fri, Feb 15, 2013 at 1:15 AM, Duncan Murdoch
>> <murdoch.duncan at gmail.com> wrote:
>>>
>>> There are several reasons I decided against that:
>>>
>>>    - two packages may request overlapping patterns, making it much
>>> messier to
>>> do the matching, checking etc, since the matching would have to depend on
>>> the package being processed.
>>
>>
>> So, isn't that somewhat already taken care of by the 'VignetteBuilder'
>> field in DESCRIPTION?  It specifies additional builders in addition to
>> the default/builtin Sweave builder.
>
>
> No, it specifies additional packages besides utils. Packages may specify
> multiple engines.

I think we're on the same page here - by "builders" I meant "packages
that provide engine for building vignettes".

> For example, knitr can handle Sweave-like knitr
> vignettes, and markdown-based vignettes.  Yihui chose to use the same engine
> for both, but it might make more sense to specify different engines.

Just to add a tiny FYI related to this comment; RSP markup is
independent of the output format, so in that case it makes sense to
have a single engine regardless of output format.

>
> So a user might say they want a knitr vignette and a .html.rsp vignette.
> But perhaps in the meantime, Yihui added an engine that can handle .rsp
> files.  So the user would have to list both packages, and there would be an
> ambiguity as to which one should be run.  You might say that's the user's
> problem, but they wouldn't complain to themselves, they'd complain to me, so
> it's my problem.

As I understand it, currently the rule is that R will take a .Rnw, /
Rmd file, scan its content for \VignetteEngine{<engine>} to infer the
vignette engine, and then apply that vignette engine to the source
file.  If no \VignetteEngine{} is found, the default is to use Sweave
(as before).  The exact same strategy can be applied with support
custom filename patterns, with the default to give an error (or
alternatively silently skip it) if no \VignetteEngine{} is found (*).
This would remove any ambiguities between an R.rsp and knitr 'rsp'
engine, just as it does for *.Rnw currently.

(*) Ideally, I'd like the default to be inferred from the file's
content type, which in turn could be guessed from the filename
extension and possibly some content-type markup (e.g.
\VignetteContentType{...}), but I'm willing to step back from that.


>
> It would be possible to design all of this to work:  the engine could check
> the file and say "oops, that's not my kind of .rsp file, try the next
> engine".  I just don't think it's worth it.  I certainly don't have time to
> design and program it or even to check your offered patch before feature
> freeze.  I can make small tweaks, but big changes that need lots of testing
> aren't going to happen.

I definitely hear you and I fully understand.


>
>
>
>  Conflicts would only happen if a
>>
>> package developer (e.g. PkgA) includes a pattern that either (A)
>> overrides the builtin in "[.][RrSs](nw|tex)$" / "[.]Rmd$" patterns, or
>> (B) specifies to builders with the same patterns.  First of all, there
>> are not that many builder packages, so this is something that could be
>> negotiated among those to minimize conflicts.  Second, case (A) can be
>> protected against by not allowing builder packages (e.g. knitr, rsp,
>> ...) to add/register those patterns (tricky but possible to test for)
>
>
> I don't think it's feasible to check for overlap in regular expression
> patterns.

Here I was only thinking of testing for overlaps with
"[.][RrSs](nw|tex)$" / "[.]Rmd$", which can be done as:

illegalPattern <- function(pattern) {
  files <- c(outer(c("R", "r", "S", "s"), c("nw", "tex"), FUN=paste,
sep=""), "Rmd")
  files <- paste(".", files, sep="")
  any(regexpr(pattern, files) != -1L)
}

But yes, checking for overlapping patterns in general would be very hard.


>
>
>> (but only default to them if that is what they wish to use).  For case
>> (B), the developer of package PkgA has the power to avoid conflicts.
>> One could also imagine the ordering of packages listed in
>> 'VignetteBuilder' would provide a prioritization.
>
>
> Sure, but it would be confusing to get an error from knitr when you didn't
> know knitr was handling .rsp.

See above reply on \VignetteEngine{}.

>
>
>> BTW, case (A) is basically what the new design is already providing;
>> all builder packages use the same patterns.
>>
>> So, from a package building point of view, I don't see how this would
>> make it messier.  I can see that when checking a package it is harder
>> to validate matches between input and output formats (is that done?).
>> Let me know if I simplifying things too much - then I'll read up on
>> the 'R CMD *' source code.
>>
>>>
>>>    - one package may request a pattern that another package uses for
>>> auxiliary files, e.g. .bib.  If a user has both types of vignette it
>>> would
>>> just be a mess.

Again, see the \VignetteEngine{} markup comment above.  That would
work as a protection and it can be completely ignored outside the R
vignette building mechanism.  If the *.bib file does not have a
\VignetteEngine{}, then it could be skipped.


>>
>>
>> I see your concern, but is there really a significant risk for this?
>
>
> If you look through CRAN, you'll see packages that do very weird things.  If
> it's legal, someone will try it.
>
>
>
>> And if it would occur, (i) it would be contained to PkgA, (ii) the
>> developer of package PkgA would quickly detect it, and (iii) the
>> "badly behaving" builder package would rather soon flagged as doing
>> something bad (and its developer would be informed and so on).
>>
>>>
>>>    - the extension is also used to determine the output format.  We only
>>> support LaTeX (which will be converted to PDF) and HTML output.  It would
>>> be
>>> reasonable to support direct PDF output, but I don't think any other
>>> output
>>> formats should be supported.
>>
>>
>> Yes, supporting PDF output makes sense.  One may also consider
>> generation of plain *.txt files (think README.txt and similar).  As I
>> see it, the restriction on supported *output* formats are given by
>> what the R help system wish to support (which is basically *.pdf and
>> *.html documents).  It's clear that the decision on what to support is
>> up to the maintainer of the R system (i.e. R core).
>>
>> When it comes to input/source files for generating those output files,
>> it's harder to argue for restrictions.  As I understand it, the new
>> support for non-Sweave vignettes is moving away from such restriction,
>> which is great.  Despite the restrictions on file extension, it is
>> possible to "hijack" (my words) any of the supported extension for
>> whatever reason you want, as long as you produce a *.pdf or *.html
>> document in the end.  More below...
>
>
> The issue is that the supplier of a custom input extension would also need
> to specify what kind of output it produced, so R knows how to handle it.
> That makes it more complicated, harder to test, etc.

This led me to dig into the tools/R/Vignettes.R source code.  Couldn't
R instead check the extension of the output file, which could be
returned by the weave function (not sure if a weave function is
required to return anything, but it could be part of the design)?
This would be a more generic solution and also support the case where
an engine takes an *.Rnw file and either outputs a *.tex file or
decides to go all the way and generate a *.pdf.

I understand that this would require passing these output files
"dynamically" to tools:::.build_vignette_index().  See also comment
below.


>
>>
>>>
>>> I understand that forcing you to use .Rmd instead of .html.rsp may look
>>> unsightly, but I think the extensions need to be fixed, not customizable.
>>
>>
>> I still find it unfortunate that the R system opens up for processing
>> any type of input files but enforces those to have certain filename
>> extensions.
>>
>> As a real example, today Sweave and knitr both use *.Rnw.  This means
>> that if I send someone a standalone *.Rnw file, they will not be able
>> to tell how to compile it without further instructions from me or by
>> inspecting the content type, or by trial and error.  I believe that
>> makes reproducible research a bit more tedious.  With unique filename
>> extensions, life is easier.  It's easy to imagine that if other
>> builder packages (e.g. R.rsp, brew, ...) also start using *.Rnw,
>> things are not going to become better.  The current "rules" are
>> pushing things in that direction.  To take an extreme stand, it's a
>> little bit like using *.txt for all your C, C++, Erlang, Fortran,
>> Simula, ... code, because it in the end of the day they all compile to
>> binaries anyway.
>
>
> I agree to some extent, but if sending someone an Rnw causes problems, then
> don't do that.  Rename it before you send it.  Or rename it to Rnw when you
> put it in the vignettes directory.

I'm not worry about my own behavior, I'm worrying about third parties
that I don't know of.  The only way for me (the R.rsp maintainer) to
protect against this (and avoid getting support emails), is to
silently/secretly have the RSP engine support *.Rnw/*.Rmd extensions
as well.  Not ideal, but I can live with it.


>
>
>>
>> One may argue that the Rnw/Rtex/Rmd extensions only apply to the R
>> package vignettes and you can still use other extensions when you work
>> with standalone vignette source files.  That's of course also
>> unfortunate, because that will add additional confusion, e.g "You can
>> find the vignette in my package, but by the way you should really
>> rename it because ...".  The exact same source file will have
>> different extensions depending on context.  (In my own case, I found
>> *.tex.rsp, *.html.rsp, *.md.rsp, *.Rnw.rsp, ... to be much less
>> ambiguous and I prefer not to introduce ambiguity in mapping those to
>> *.Rnw/*.Rtex/*.Rmd.)
>
>
> You could map them to .tex.Rnw, .html.Rmd, .Rnw.Rnw, and your engine could
> do what it does now with *.rsp files.

When looking into this idea of basically using *.Rnw and *.Rmd (or
possibly .tex.rsp.Rnw, .html.rsp.Rmd, .Rnw.rsp.Rnw) as "triggers", I
had a look at the tools/R/Vignettes.R source code and discovered:

vignette_output <- function(filenames) {
   outfiles <- sub("\\.[RrSs](nw|tex)$", ".pdf", filenames)
   sub("\\.Rmd$", ".html", outfiles)
}

which gives:

> files <- c(".tex.Rnw", ".html.Rmd", ".Rnw.Rnw")
> vignette_output(files)
[1] ".tex.pdf"   ".html.html" ".Rnw.pdf"

> files <- c(".tex.rsp.Rnw", ".html.rsp.Rmd", ".Rnw.rsp.Rnw")
> vignette_output(files)
[1] ".tex.rsp.pdf"   ".html.rsp.html" ".Rnw.rsp.pdf"

Next, looking at tools:::.build_vignette_index() these are also the
filenames (iff existing otherwise empty), that will be listed as PDFs
(or HTMLs) output in the vignette index.

Unfortunately, R.rsp operates by taking an *.<ext>.rsp file and
dropping the filename extension to arrive at a *.<ext> file.  (This
can actually be done recursively and there are options to run
postprocessors that would continue processing the output file if their
file content/ext is recognize, e.g. *.tex.rsp will generate *.tex
which will be compile to a PDF or *.Rnw.rsp will generate *.Rnw which
will be passed to Sweave/knitr to generate *.tex which will be compile
to a PDF).  This means that the RSP engine would need to preserve
those "intermediate" filename extensions to fit the R vignette setup.
Again, I could probably find ad hoc workarounds for this too.


BTW, I don't think the requirement that the input and output files for
vignettes should have matching filenames after dropping the filename
extension is explicitly documented.  ?tools::vignetteEngine could be
interpreted as it is, but only if you know.  (Also, deep down
?RweaveLatex mention a default behavior but that's not the same as the
requirement).  If this requirement is intended, then I would suggest
to clarify ?tools::vignetteEngine from:

<quote>If the filename being processed has one of the Sweave
extensions (i.e. matching the regular expression ".[RrSs](nw|tex)$"),
the weave function should produce a ‘.tex’ file in the same
directory."</quote>

to:

<update>If the filename being processed has one of the Sweave
extensions (i.e. matching the regular expression ".[RrSs](nw|tex)$"),
the weave function should produce a file in the same directory with
the filename extension replaced by ".tex".</update>


While speaking about ?tools::vignetteEngine.  Its 'Description' is not
clear on whether it packages with vignettes (e.g. PkgA) or the builder
packages (e.g. knitr, R.rsp, ...) that should call this function;

<quote>Vignettes are normally processed by Sweave, but package writers
may choose to use a different engine (e.g. knitr or noweb). This
function is used by those packages to register their engines, and
internally by R to retrieve them.</quote>

The problematic word is "those".  Maybe the following is better:

<update>Vignettes are normally processed by Sweave, but package
writers may choose to use a different engine (e.g. knitr or noweb).
Packages (e.g. knitr) that provide vignette engines should register
those engines in their .onLoad() function so that R can retrieve them
internally.  [See Section 'Non-Sweave vignettes' in 'Writing R
Extensions' for further details.]</update>



>
>>
>> Finally, the supported extensions are basically *.Rnw, *.Rtex and
>> *.Rmd.  To break those down, "*nw" originates from 'Noweb'
>> [http://wikipedia.org/wiki/Noweb], "*tex" from TeX
>> [http://wikipedia.org/wiki/latex] and "*md" from Markdown
>> [wikipedia.org/wiki/Markdown].  The "R*" part indicates that there is
>> some additional markup format to those file formats.  But in the end
>> of the day, they indicate that the source files should be
>> markup-embedded files containing some flavor of Noweb, TeX or
>> Markdown.  I find it weird to use those also for, say, formats such as
>> HTML, reStructuredText, AsciiDoc, MediaWiki, Org-Mode etc.
>
>
> That's the etymology, not the meaning.

Mmmkay... if so, then the following passage should be corrected in WRE
Section 'Non-Sweave vignettes':

<quote>R recognizes non-Sweave vignettes using the same extensions as
for Sweave vignettes; in addition, the extension .Rmd (standing for “R
markdown”) is supported.</quote>


>
>>
>>
>> To summarize, I really appreciate the move to a built-in support for
>> non-Sweave vignettes (without using custom Makefiles), but I find that
>> the supported filename extensions has not been brought along in this
>> move.
>>
>
> There's always time to argue for R 3.1.0.

I'm glad to see those words.  I understand the R 3.0.0 deadline, so
consider what I have said this far as arguments for R 3.x.0 (x > 0).


Thanks again for your replies.  They do clarifies your design
strategies and constraints, which help me going forward.

Henrik


>
> Duncan Murdoch
>



More information about the R-devel mailing list