[Rd] gettext(msgid, domain="R") doesn't work for some 'msgid':s

Martin Maechler m@ech|er @end|ng |rom @t@t@m@th@ethz@ch
Fri Nov 5 18:23:56 CET 2021

>>>>> Martin Maechler 
>>>>>     on Fri, 5 Nov 2021 17:55:24 +0100 writes:

>>>>> Tomas Kalibera 
>>>>>     on Fri, 5 Nov 2021 16:15:19 +0100 writes:

    >> On 11/5/21 4:12 PM, Duncan Murdoch wrote:
    >>> On 05/11/2021 10:51 a.m., Henrik Bengtsson wrote:
    >>>> I'm trying to reuse some of the translations available in base R by 
    >>>> using:
    >>>>    gettext(msgid, domain="R")
    >>>> This works great for most 'msgid's, e.g.
    >>>> $ LANGUAGE=de Rscript -e 'gettext("cannot get working directory", 
    >>>> domain="R")'
    >>>> [1] "kann das Arbeitsverzeichnis nicht ermitteln"
    >>>> However, it does not work for all.  For instance,
    >>>> $ LANGUAGE=de Rscript -e 'gettext("Execution halted\n", domain="R")'
    >>>> [1] "Execution halted\n"
    >>>> This despite that 'msgid' existing in:
    >>>> $ grep -C 2 -F 'Execution halted\n' src/library/base/po/de.po
    >>>> #: src/main/main.c:342
    >>>> msgid "Execution halted\n"
    >>>> msgstr "Ausführung angehalten\n"
    >>>> It could be that the trailing newline causes problems, because the
    >>>> same happens also for:
    >>>> $ LANGUAGE=de Rscript --vanilla -e 'gettext("error during cleanup\n",
    >>>> domain="R")'
    >>>> [1] "error during cleanup\n"
    >>>> Is this meant to work, and if so, how do I get it to work, or is it a 
    >>>> bug?
    >>> I don't know the solution, but I think the cause is different than you 
    >>> think, because I also have the problem with other strings not 
    >>> including "\n":
    >>> $ LANGUAGE=de Rscript -e 'gettext("malformed version string", 
    >>> domain="R")'
    >>> [1] "malformed version string"

    > You need domain="R-base" for the  "malformed version "string"

    >> I can reproduce Henrik's report and the problem there is that the 
    >> trailing \n is stripped by R before doing the lookup, in do_gettext

    >>             /* strip leading and trailing white spaces and
    >>                add back after translation */
    >>             for(p = tmp;
    >>                 *p && (*p == ' ' || *p == '\t' || *p == '\n');
    >>                 p++, ihead++) ;

    >> But, calling dgettext with the trailing \n does translate correctly for me.

    >> I'd leave to translation experts how this should work (e.g. whether the 
    >> .po files should have trailing newlines).

    > Thanks a lot, Tomas.
    > This is "interesting" .. and I think an R bug  one way or the
    > other (and I also note that Henrik's guess was also right on !).

    > We have the following:

    > - New translation *.po source files are to be made from the original *.pot  files.

    > In our case it's our code that produce  R.pot and R-base.pot  
    > (and more for the non-base packages, and more e.g. for
    > Recommended packages 'Matrix' and 'cluster' I maintain).

    > And notably the R.pot (from all the "base" C error/warn/.. messages)
    > contains tons of msgid strings of the form  ".......\n"
    > i.e., ending in \n.
    >> From that automatically the translator's  *.po files should also
    > end in \n.

    > Additionally, the GNU gettext FAQ has
    > (here :   https://www.gnu.org/software/gettext/FAQ.html#newline )

    > ------------------------------------------------
    > Q: What does this mean: “'msgid' and 'msgstr' entries do not both end with '\n'”

    > A: It means that when the original string ends in a newline, your translation must also end in a newline. And if the original string does not end in a newline, then your translation should likewise not have a newline at the end.
    > ------------------------------------------------
    >> From all that I'd conclude that we (R base code) are the source
    > of the problem.
    > Given the above FAQ, it seems common in other projects also to
    > have such trailing \n  and so we should really change the C code
    > you cite above.

    > On the other hand, this is from almost the very beginning of
    > when Brian added translation to R,
    > ------------------------------------------------------------------------
    > r32938 | ripley | 2005-01-30 20:24:04 +0100 (Sun, 30 Jan 2005) | 2 lines

    > include \n in whitespace ignored for R-level gettext
    > ------------------------------------------------------------------------

    > I think this has been because simultaneously we had started to
    > emphasize to useRs  they should *not* end message/format strings
    > in stop() / warning()  by a new line, but rather stop() and
    > warning() would *add* the newlines(s) themselves.

    > Still, currently we have a few such cases in  R-base.pot,
    > but just these few and maybe they really are "in error", in the
    > sense we could drop the ending '\n' (and do the same in all the *.po files!),
    > and newlines would be appended later {{not just by Rstudio which
    > graceously adds final newlines in its R console, even for say
    > cat("abc") }}

    > However, this is quite different for all the message strings from C, as
    > used there in  error() or warn() e.g., and so in   R.pot
    > we see many many msg strings ending in "\n" (which must then
    > also be in the *.po files.

    > My current conclusion is we should try simplifying the
    > do_gettext() code and *not* remove and re-add the '\n' (nor the
    > '\t' I think ...)

After such a change, I indeed  do see

$ LANGUAGE=de bin/Rscript --vanilla -e 'gettext("Execution halted\n", domain="R")'
[1] "Ausführung angehalten\n"
$ LANGUAGE=de bin/Rscript --vanilla -e 'message("Execution halted\n", domain="R")'
Ausführung angehalten

$ LANGUAGE=de bin/Rscript --vanilla -e 'warning("Execution halted\n", domain="R")'
Ausführung angehalten

(note the extra newline after the German translation!)
whereas before, not only using  gettext() directly did not work,
but also using warning() or message()  {with or without trailing \n} 
were never translated.

... and my simple  #ifdef .. #endif change around the head/tail
save and restor seems to pass make check-devel ...

so I will be looking into dropping all those "head" and "tail" add
and remove parts in do_gettext() as they really seem to harm given the current
translation data bases which indeed *are* full of final '\n' in
`msgid` and corresponding translated `msgstr` ....

So, no need for a bugzilla PR nor a patch, please.
Maybe further examples which add something interesting in
addition to the ones we have here.

Thank you again, Henrik, Duncan, and Tomas!


