[Rd] NEWS item for bugfix in normalizePath and file.exists?

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Wed Apr 28 18:40:42 CEST 2021


On 4/28/21 6:20 PM, Toby Hocking wrote:
> +1 for Martin's proposal, that makes sense to me too.
> About Tomas' idea to immediately stop with an error when the user tries to
> create a string which is invalid in its declared encoding, that sounds
> great. I'm just wondering if that would break my application. My package is
> running an example during a check, in which the unicode/emoji is read into
> R using readLines from a file under inst/extdata, so presumably it should
> work as long as readLines handles the encoding correctly and/or the locale
> during package check is changed to something more reasonable on windows?

Once we have UTF-8 as native encoding on Windows, things like this 
should work reliably. It should be already the case with the 
experimental UCRT builds.

Even in the MSVCRT/official builds, in some cases things like this could 
work on Windows, depending on whether they trigger translation to native 
encoding or not. E.g. readLines() with encoding="UTF-8" argument would 
produce strings flagged as UTF-8, so indeed ones that could be 
translated to UTF-16LE if they are valid. Some file operations on 
Windows work with UTF-8 pathnames avoiding translation to native 
encoding, but not all, and instead of investing effort into fixing more 
we should I think instead invest into switching to UTF-8 as native encoding.

Actually, by using Emoji's you may also trigger bugs when supplementary 
characters are not supported on Windows. This is something that is still 
relevant after the switch to UTF-8 as native encoding, so something that 
needs to be fixed fully, and there have been some improvements recently.

Tomas

>
> On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera <tomas.kalibera using gmail.com>
> wrote:
>
>> On 4/28/21 5:22 PM, Martin Maechler wrote:
>>>>>>>> Toby Hocking
>>>>>>>>       on Wed, 28 Apr 2021 07:21:05 -0700 writes:
>>>       > Hi Tomas, thanks for the thoughtful reply. That makes sense about
>> the
>>>       > problems with C locale on windows. Actually I did not choose to
>> use C
>>>       > locale, but instead it was invoked automatically during a package
>> check.
>>>       > To be clear, I do NOT have a file with that name, but I do want
>> file.exists
>>>       > to return a reasonable value, FALSE (with no error). If that
>> behavior is
>>>       > unspecified, then should I use something like
>> tryCatch(file.exists(x),
>>>       > error=function(e)FALSE) instead of assuming that file.exists will
>> always
>>>       > return a logical vector without error? For my particular
>> application that
>>>       > work-around should probably be sufficient, but one may imagine a
>> situation
>>>       > where you want to do
>>>
>>>       > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
>>>       > \360\237\247\222\360\237\217\274\n|
>> \360\237\247\222\360\237\217\275\n|
>>>       > \360\237\247\222\360\237\217\276\n|
>> \360\237\247\222\360\237\217\277\n"
>>>       > Encoding(x) <- "unknown"
>>>       > Sys.setlocale(locale="C")
>>>       > f <- tempfile()
>>>       > cat("", file = f)
>>>       > two <- c(x, f)
>>>       > file.exists(two)
>>>
>>>       > and in that case the correct response from R, in my opinion,
>> would be
>>>       > c(FALSE, TRUE) -- not an error.
>>>       > Toby
>>>
>>> Indeed, thanks a lot to Tomas!
>>>
>>> # A remark
>>> We *could* -- and according to my taste should -- try to have
>> file.exists()
>>> return a logical vector in almost all cases, namely, e.g., still give an
>>> error for file.exists(pi) :
>>> Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
>>> is a character vector, always return a logical vector of the same
>>> length, *and* we could notably make use of the fact that R's
>>> logical type is not binary but ternary, and hence that return
>>> value could contain values from {TRUE, NA, FALSE}  and interpret NA
>>> as "don't know" in all cases where the corresponding string in
>>> the input had an Encoding(.) that was "fishy" in some sense
>>> given the "context" (OS, locale, OS_version, ICU-presence, ...).
>>>
>>> In particular, when the underlying code sees encoding-translation issues
>>> for a string,  NA  would be returned instead of an error.
>> Yes, I agree with Toby and you that there is benefit in allowing
>> per-element, vectorized use of file.exists(), and well it is the case
>> now, we just fall back to FALSE. NA might be be better in case of error
>> that prevents the function from deciding whether the file exists or not
>> (e.g. an invalid name in form that make is clear such file cannot exist
>> might be a different case...).
>>
>> But, the only way to get a translation error is by passing a string to
>> file.exists() which is invalid in its declared encoding (or which is in
>> "C" encoding). I would hope that we could get to the point where such
>> situation is prevented (we only allow creation of strings that can be
>> translated to Unicode). If we get there, the example would fail with
>> error (yet, right, before getting to file.exists()).
>>
>> My point that I would not write tests of this behavior stands. One
>> should not use such file names, and after the change Toby reported from
>> ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
>> ERROR, etc. So it is best for now to leave it unspecified and not
>> trigger it, I think.
>>
>> Tomas
>>
>>> Martin
>>>
>>>       > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera <
>> tomas.kalibera using gmail.com>
>>>       > wrote:
>>>
>>>       >> Hi Toby,
>>>       >>
>>>       >> a defensive, portable approach would be to use only file names
>> regarded
>>>       >> portable by POSIX, so characters including ASCII letters, digits,
>>>       >> underscore, dot, hyphen (but hyphen should not be the first
>> character).
>>>       >> That would always work on all systems and this is what I would
>> use.
>>>       >>
>>>       >> Individual operating systems and file systems and their
>> configurations
>>>       >> differ in which additional characters they support and how. On
>> some,
>>>       >> file names are just sequences of bytes, on some, they have to be
>> valid
>>>       >> strings in certain encoding (and then with certain exceptions).
>>>       >>
>>>       >> On Windows, file names are at the lowest level in UTF-16LE
>> encoding (and
>>>       >> admitting unpaired surrogates for historical reasons). R stores
>> strings
>>>       >> in other encodings (UTF-8, native, Latin-1), so file names have
>> to be
>>>       >> translated to/from UTF-16LE, either directly by R or by Windows.
>>>       >>
>>>       >> But, there is no way to convert (non-ASCII) strings in "C"
>> encoding to
>>>       >> UTF16-LE, so the examples cannot be made to work on Windows.
>>>       >>
>>>       >> When the translation is left on Windows, it assumes the
>> non-UTF-16LE
>>>       >> strings are in the Active Code Page encoding (shown as "system
>> encoding"
>>>       >> in sessionInfo() in R, Latin-1 in your example) instead of the
>> current C
>>>       >> library encoding ("C" in your example). So, file names coming
>> from
>>>       >> Windows will be either the bytes of their UTF-16LE
>> representation or the
>>>       >> bytes of their Latin-1 representation, but which one is subject
>> to the
>>>       >> implementation details, so the result is really unusable.
>>>       >>
>>>       >> I would say using "C" as encoding in R is not a good idea, and
>>>       >> particularly not on Windows.
>>>       >>
>>>       >> I would say that what happens with such file names in "C"
>> encoding is
>>>       >> unspecified behavior, which is subject to change at any time
>> without
>>>       >> notice, and that both the R 4.0.5 and R-devel behavior you are
>> observing
>>>       >> are acceptable. I don't think it should be mentioned in the NEWS.
>>>       >> Personally, I would prefer some stricter checks of strings
>> validity and
>>>       >> perhaps disallowing the "C" encoding in R, so yet another
>> behavior where
>>>       >> it would be clearer that this cannot really work, but that would
>> require
>>>       >> more thought and effort.
>>>       >>
>>>       >> Best
>>>       >> Tomas
>>>       >>
>>>       >>
>>>       >> On 4/27/21 9:53 PM, Toby Hocking wrote:
>>>       >>
>>>       >> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
>> fixed in
>>>       >> > R-devel already. I checked on
>>>       >> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and
>> there is no
>>>       >> > mention of these changes, so I'm wondering if they are
>> intentional? If
>>>       >> so,
>>>       >> > could someone please add a mention of the bugfix in the NEWS?
>>>       >> >
>>>       >> > The problem involves file.exists, on windows, when a
>> long/strange input
>>>       >> > file name Encoding is unknown, in C locale. I expected that
>> FALSE should
>>>       >> be
>>>       >> > returned (and it is on R-devel), but I got an error in
>> R-4.0.5. Code to
>>>       >> > reproduce is:
>>>       >> >
>>>       >> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
>>>       >> > \360\237\247\222\360\237\217\274\n|
>> \360\237\247\222\360\237\217\275\n|
>>>       >> > \360\237\247\222\360\237\217\276\n|
>> \360\237\247\222\360\237\217\277\n"
>>>       >> > Encoding(x) <- "unknown"
>>>       >> > Sys.setlocale(locale="C")
>>>       >> > sessionInfo()
>>>       >> > file.exists(x)
>>>       >> >
>>>       >> > Output I got from R-4.0.5 was
>>>       >> >
>>>       >> >> sessionInfo()
>>>       >> > R version 4.0.5 (2021-03-31)
>>>       >> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>       >> > Running under: Windows 10 x64 (build 19042)
>>>       >> >
>>>       >> > Matrix products: default
>>>       >> >
>>>       >> > locale:
>>>       >> > [1] C
>>>       >> > system code page: 1252
>>>       >> >
>>>       >> > attached base packages:
>>>       >> > [1] stats     graphics  grDevices utils     datasets  methods
>>   base
>>>       >> >
>>>       >> > loaded via a namespace (and not attached):
>>>       >> > [1] compiler_4.0.5
>>>       >> >> file.exists(x)
>>>       >> > Error in file.exists(x) : file name conversion problem -- name
>> too long?
>>>       >> > Execution halted
>>>       >> >
>>>       >> > Output I got from R-devel was
>>>       >> >
>>>       >> >> sessionInfo()
>>>       >> > R Under development (unstable) (2021-04-26 r80229)
>>>       >> > Platform: x86_64-w64-mingw32/x64 (64-bit)
>>>       >> > Running under: Windows 10 x64 (build 19042)
>>>       >> >
>>>       >> > Matrix products: default
>>>       >> >
>>>       >> > locale:
>>>       >> > [1] C
>>>       >> >
>>>       >> > attached base packages:
>>>       >> > [1] stats     graphics  grDevices utils     datasets  methods
>>   base
>>>       >> >
>>>       >> > loaded via a namespace (and not attached):
>>>       >> > [1] compiler_4.2.0
>>>       >> >> file.exists(x)
>>>       >> > [1] FALSE
>>>       >> >
>>>       >> > I also observed similar results when using normalizePath
>> instead of
>>>       >> > file.exists (error in R-4.0.5, no error in R-devel).
>>>       >> >
>>>       >> >> normalizePath(x) #R-4.0.5
>>>       >> > Error in path.expand(path) : unable to translate 'p'
>>>       >> > | p'p;
>>>       >> > | p'p<
>>>       >> > | p'p=
>>>       >> > | p'p>
>>>       >> > | p'p<bf>
>>>       >> > ' to UTF-8
>>>       >> > Calls: normalizePath -> path.expand
>>>       >> > Execution halted
>>>       >> >
>>>       >> >> normalizePath(x) #R-devel
>>>       >> > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
>>>       >> > \360\237\247\222\360\237\217\273\n|
>> \360\237\247\222\360\237\217\274\n|
>>>       >> > \360\237\247\222\360\237\217\275\n|
>> \360\237\247\222\360\237\217\276\n|
>>>       >> > \360\237\247\222\360\237\217\277\n"
>>>       >> > Warning message:
>>>       >> > In normalizePath(path.expand(path), winslash, mustWork) :
>> path[1]="🧒
>>>       >> > | 🧒🏻
>>>       >> > | 🧒🏼
>>>       >> > | 🧒🏽
>>>       >> > | 🧒🏾
>>>       >> > | 🧒🏿
>>>       >> > ": The filename, directory name, or volume label syntax is
>> incorrect
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list