[Rd] Encoding errors in Rd files

Wed Jul 25 10:04:04 CEST 2012

On 12-07-25 3:24 AM, steven mosher wrote:
> Thank Dr. Ripley.
>
>   When I read the instructions
> " If the DESCRIPTION file is not entirely in ASCII it should contain an ‘
> Encoding’ field specifying an encoding. This is used as the encoding of the
> DESCRIPTION file itself and of the R and NAMESPACE files, and as the
> default encoding of .Rd files.  "
>
> I assumed that I should specify an encoding of UTF-8 in the description
> file to handle the specific Rd files that were having problems.
>
> After I removed  the encoding field (UTF-8)  I had added to the DESCRIPTION
> file, the following error no longer occurred
>
> "  Package inputenc Error: Keyboard character used is undefined
> (inputenc)                in inputencoding `utf8'. "
>
> however the rd files still had errors.
>
>     looking at the latex versions of the Rd files I was able to spot which
> characters in the Rd files
> were offending. In my normal editor they were simply not showing up, so I
> had no idea what characters were
> causing the problem.
>
> The only thing I am left with is the entries in the data frame. The package
> came with some predefined  .rda files
> in its data subdirectory.  Using encoding() on  the column of the data
> frame that contains the following items
>
>   checking data for non-ASCII characters ... WARNING
>     Warning: found non-ASCII string(s)
>     'Tourbihre de la Rivihre-aux-Feu' in object 'modpoll'
>     'Lac ` la Fourche' in object 'modpoll'
>     'Lac ` la Loutre' in object 'modpoll'
>     'Lac Kinogami' in object 'modpoll'
>
> I get "unknown"  for all the items.  So, If I understand you I should take
> this dataframe, change the encoding
> to UTF-8.

The issue is almost certainly with accented characters in the text.  For 
example, the 5th letter in Rivière is an e with a grave accent.  It is 
displayed as "h" in your error message, because R was not told how to 
interpret the way it is stored in the source file, or was told something 
that turned out to be incorrect.

You need to change the source file so that it is stored in the UTF-8 
encoding.  That means you should read the file into an editor that 
displays it correctly (and that's sometimes hard when you don't know the 
original encoding; you may need to do some manual editing), then save it 
again, specifying that it should be saved using the UTF-8 encoding.  How 
you do that depends on your editor.

Then when you tell R that it is encoded in UTF-8, R will read it 
properly and won't complain.

The tools::showNonASCIIfile() function can help to find characters that 
may need fixing.  R can recognize when things are not ASCII (those bytes 
have the high bit set), but it will be up to you to figure out what 
encoding was actually used.  For French, latin1 is a good guess but it 
is not necessarily right.

Duncan Murdoch

>
> Sorry for being so dense
>
> On Tue, Jul 24, 2012 at 1:46 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk>wrote:
>
>> On 24/07/2012 21:08, steven mosher wrote:
>>
>>> Well, I'm working on project trying to bring back an old package last
>>> published on R 1.9 back to life.
>>> I'm almost there but I am getting killed by an encoding error in the Rd
>>> files
>>>
>>> After reading the manual, I decided to try UTF-8.  Mostly because I could
>>> spell it. ha.
>>>
>>> That got me a bit closer but I still have these warnings
>>>
>>> * checking data for non-ASCII characters ... WARNING
>>>     Warning: found non-ASCII string(s)
>>>     'Tourbihre de la Rivihre-aux-Feu' in object 'modpoll'
>>>     'Lac ` la Fourche' in object 'modpoll'
>>>     'Lac ` la Loutre' in object 'modpoll'
>>>     'Lac Kinogami' in object 'modpoll'
>>>
>>
>> How to handle those is in 'Writing R Extensions': basically convert to
>> UTF-8 and mark them as UTF-8.
>>
>>
>>   * checking data for ASCII and uncompressed saves ... OK
>>> * checking examples ... OK
>>> * checking PDF version of manual ... WARNING
>>> LaTeX errors when creating PDF version.
>>> This typically indicates Rd problems.
>>> LaTeX errors found:
>>>    ! Package inputenc Error: Keyboard character used is undefined
>>> (inputenc)                in inputencoding `utf8'.
>>>
>>> I'll keep searching the help list archives for a clue, but If somebody
>>> could point me at educational material it's really time
>>> that I learn this aspect.
>>>
>>
>> Without the actual file we can do little.  The message means that
>> something in the manual inputs (and it could be the DESCRIPTION file or an
>> Rd file) contains a character not known to LaTeX.  Most likely it is simply
>> not a UTF-8 character, but it could also be outside LaTeX's gamut.
>>
>> Normally the LaTeX log (which is in the check output) is more revealing:
>> you can also try this part alone with R CMD Rd2pdf (and R CMD Rd2pdf
>> --no-description often points the finger at the DESCRIPTION file).
>>
>>
>>
>>> I've read    http://developer.r-project.**org/Encodings_and_R.html<http://developer.r-project.org/Encodings_and_R.html>
>>>
>>> How do I figure out which encoding to use with the error seen above
>>>
>>
>> Assuming this is not something esoteric, UTF-8 is the most comprehensive
>> choice, but LaTeX's UTF-8 coverage (and that of the fonts used) is heavily
>> biased to Western European scripts.  So for example for Lithuanian you may
>> want to choose something else (Latin-7?).
>>
>>
>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~**ripley/<http://www.stats.ox.ac.uk/~ripley/>
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>
> 	[[alternative HTML version deleted]]
>
>
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>