[Rd] NEWS item for bugfix in normalizePath and file.exists?

Toby Hocking tdhock5 @end|ng |rom gm@||@com
Wed Apr 28 18:20:52 CEST 2021


+1 for Martin's proposal, that makes sense to me too.
About Tomas' idea to immediately stop with an error when the user tries to
create a string which is invalid in its declared encoding, that sounds
great. I'm just wondering if that would break my application. My package is
running an example during a check, in which the unicode/emoji is read into
R using readLines from a file under inst/extdata, so presumably it should
work as long as readLines handles the encoding correctly and/or the locale
during package check is changed to something more reasonable on windows?

On Wed, Apr 28, 2021 at 9:04 AM Tomas Kalibera <tomas.kalibera using gmail.com>
wrote:

>
> On 4/28/21 5:22 PM, Martin Maechler wrote:
> >>>>>> Toby Hocking
> >>>>>>      on Wed, 28 Apr 2021 07:21:05 -0700 writes:
> >      > Hi Tomas, thanks for the thoughtful reply. That makes sense about
> the
> >      > problems with C locale on windows. Actually I did not choose to
> use C
> >      > locale, but instead it was invoked automatically during a package
> check.
> >      > To be clear, I do NOT have a file with that name, but I do want
> file.exists
> >      > to return a reasonable value, FALSE (with no error). If that
> behavior is
> >      > unspecified, then should I use something like
> tryCatch(file.exists(x),
> >      > error=function(e)FALSE) instead of assuming that file.exists will
> always
> >      > return a logical vector without error? For my particular
> application that
> >      > work-around should probably be sufficient, but one may imagine a
> situation
> >      > where you want to do
> >
> >      > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> >      > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> >      > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> >      > Encoding(x) <- "unknown"
> >      > Sys.setlocale(locale="C")
> >      > f <- tempfile()
> >      > cat("", file = f)
> >      > two <- c(x, f)
> >      > file.exists(two)
> >
> >      > and in that case the correct response from R, in my opinion,
> would be
> >      > c(FALSE, TRUE) -- not an error.
> >      > Toby
> >
> > Indeed, thanks a lot to Tomas!
> >
> > # A remark
> > We *could* -- and according to my taste should -- try to have
> file.exists()
> > return a logical vector in almost all cases, namely, e.g., still give an
> > error for file.exists(pi) :
> > Notably  if  `c(...)`  {for the  `...`  arguments of file.exists() }
> > is a character vector, always return a logical vector of the same
> > length, *and* we could notably make use of the fact that R's
> > logical type is not binary but ternary, and hence that return
> > value could contain values from {TRUE, NA, FALSE}  and interpret NA
> > as "don't know" in all cases where the corresponding string in
> > the input had an Encoding(.) that was "fishy" in some sense
> > given the "context" (OS, locale, OS_version, ICU-presence, ...).
> >
> > In particular, when the underlying code sees encoding-translation issues
> > for a string,  NA  would be returned instead of an error.
>
> Yes, I agree with Toby and you that there is benefit in allowing
> per-element, vectorized use of file.exists(), and well it is the case
> now, we just fall back to FALSE. NA might be be better in case of error
> that prevents the function from deciding whether the file exists or not
> (e.g. an invalid name in form that make is clear such file cannot exist
> might be a different case...).
>
> But, the only way to get a translation error is by passing a string to
> file.exists() which is invalid in its declared encoding (or which is in
> "C" encoding). I would hope that we could get to the point where such
> situation is prevented (we only allow creation of strings that can be
> translated to Unicode). If we get there, the example would fail with
> error (yet, right, before getting to file.exists()).
>
> My point that I would not write tests of this behavior stands. One
> should not use such file names, and after the change Toby reported from
> ERROR to FALSE, Martin's proposal would change to NA, mine eventually to
> ERROR, etc. So it is best for now to leave it unspecified and not
> trigger it, I think.
>
> Tomas
>
> >
> > Martin
> >
> >      > On Wed, Apr 28, 2021 at 3:10 AM Tomas Kalibera <
> tomas.kalibera using gmail.com>
> >      > wrote:
> >
> >      >> Hi Toby,
> >      >>
> >      >> a defensive, portable approach would be to use only file names
> regarded
> >      >> portable by POSIX, so characters including ASCII letters, digits,
> >      >> underscore, dot, hyphen (but hyphen should not be the first
> character).
> >      >> That would always work on all systems and this is what I would
> use.
> >      >>
> >      >> Individual operating systems and file systems and their
> configurations
> >      >> differ in which additional characters they support and how. On
> some,
> >      >> file names are just sequences of bytes, on some, they have to be
> valid
> >      >> strings in certain encoding (and then with certain exceptions).
> >      >>
> >      >> On Windows, file names are at the lowest level in UTF-16LE
> encoding (and
> >      >> admitting unpaired surrogates for historical reasons). R stores
> strings
> >      >> in other encodings (UTF-8, native, Latin-1), so file names have
> to be
> >      >> translated to/from UTF-16LE, either directly by R or by Windows.
> >      >>
> >      >> But, there is no way to convert (non-ASCII) strings in "C"
> encoding to
> >      >> UTF16-LE, so the examples cannot be made to work on Windows.
> >      >>
> >      >> When the translation is left on Windows, it assumes the
> non-UTF-16LE
> >      >> strings are in the Active Code Page encoding (shown as "system
> encoding"
> >      >> in sessionInfo() in R, Latin-1 in your example) instead of the
> current C
> >      >> library encoding ("C" in your example). So, file names coming
> from
> >      >> Windows will be either the bytes of their UTF-16LE
> representation or the
> >      >> bytes of their Latin-1 representation, but which one is subject
> to the
> >      >> implementation details, so the result is really unusable.
> >      >>
> >      >> I would say using "C" as encoding in R is not a good idea, and
> >      >> particularly not on Windows.
> >      >>
> >      >> I would say that what happens with such file names in "C"
> encoding is
> >      >> unspecified behavior, which is subject to change at any time
> without
> >      >> notice, and that both the R 4.0.5 and R-devel behavior you are
> observing
> >      >> are acceptable. I don't think it should be mentioned in the NEWS.
> >      >> Personally, I would prefer some stricter checks of strings
> validity and
> >      >> perhaps disallowing the "C" encoding in R, so yet another
> behavior where
> >      >> it would be clearer that this cannot really work, but that would
> require
> >      >> more thought and effort.
> >      >>
> >      >> Best
> >      >> Tomas
> >      >>
> >      >>
> >      >> On 4/27/21 9:53 PM, Toby Hocking wrote:
> >      >>
> >      >> > Hi all, Today I noticed bug(s?) in R-4.0.5, which seem to be
> fixed in
> >      >> > R-devel already. I checked on
> >      >> > https://developer.r-project.org/blosxom.cgi/R-devel/NEWS and
> there is no
> >      >> > mention of these changes, so I'm wondering if they are
> intentional? If
> >      >> so,
> >      >> > could someone please add a mention of the bugfix in the NEWS?
> >      >> >
> >      >> > The problem involves file.exists, on windows, when a
> long/strange input
> >      >> > file name Encoding is unknown, in C locale. I expected that
> FALSE should
> >      >> be
> >      >> > returned (and it is on R-devel), but I got an error in
> R-4.0.5. Code to
> >      >> > reproduce is:
> >      >> >
> >      >> > x <- "\360\237\247\222\n| \360\237\247\222\360\237\217\273\n|
> >      >> > \360\237\247\222\360\237\217\274\n|
> \360\237\247\222\360\237\217\275\n|
> >      >> > \360\237\247\222\360\237\217\276\n|
> \360\237\247\222\360\237\217\277\n"
> >      >> > Encoding(x) <- "unknown"
> >      >> > Sys.setlocale(locale="C")
> >      >> > sessionInfo()
> >      >> > file.exists(x)
> >      >> >
> >      >> > Output I got from R-4.0.5 was
> >      >> >
> >      >> >> sessionInfo()
> >      >> > R version 4.0.5 (2021-03-31)
> >      >> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> >      >> > Running under: Windows 10 x64 (build 19042)
> >      >> >
> >      >> > Matrix products: default
> >      >> >
> >      >> > locale:
> >      >> > [1] C
> >      >> > system code page: 1252
> >      >> >
> >      >> > attached base packages:
> >      >> > [1] stats     graphics  grDevices utils     datasets  methods
>  base
> >      >> >
> >      >> > loaded via a namespace (and not attached):
> >      >> > [1] compiler_4.0.5
> >      >> >> file.exists(x)
> >      >> > Error in file.exists(x) : file name conversion problem -- name
> too long?
> >      >> > Execution halted
> >      >> >
> >      >> > Output I got from R-devel was
> >      >> >
> >      >> >> sessionInfo()
> >      >> > R Under development (unstable) (2021-04-26 r80229)
> >      >> > Platform: x86_64-w64-mingw32/x64 (64-bit)
> >      >> > Running under: Windows 10 x64 (build 19042)
> >      >> >
> >      >> > Matrix products: default
> >      >> >
> >      >> > locale:
> >      >> > [1] C
> >      >> >
> >      >> > attached base packages:
> >      >> > [1] stats     graphics  grDevices utils     datasets  methods
>  base
> >      >> >
> >      >> > loaded via a namespace (and not attached):
> >      >> > [1] compiler_4.2.0
> >      >> >> file.exists(x)
> >      >> > [1] FALSE
> >      >> >
> >      >> > I also observed similar results when using normalizePath
> instead of
> >      >> > file.exists (error in R-4.0.5, no error in R-devel).
> >      >> >
> >      >> >> normalizePath(x) #R-4.0.5
> >      >> > Error in path.expand(path) : unable to translate 'p'
> >      >> > | p'p;
> >      >> > | p'p<
> >      >> > | p'p=
> >      >> > | p'p>
> >      >> > | p'p<bf>
> >      >> > ' to UTF-8
> >      >> > Calls: normalizePath -> path.expand
> >      >> > Execution halted
> >      >> >
> >      >> >> normalizePath(x) #R-devel
> >      >> > [1] "C:\\Users\\th798\\R\\\360\237\247\222\n|
> >      >> > \360\237\247\222\360\237\217\273\n|
> \360\237\247\222\360\237\217\274\n|
> >      >> > \360\237\247\222\360\237\217\275\n|
> \360\237\247\222\360\237\217\276\n|
> >      >> > \360\237\247\222\360\237\217\277\n"
> >      >> > Warning message:
> >      >> > In normalizePath(path.expand(path), winslash, mustWork) :
> path[1]="🧒
> >      >> > | 🧒🏻
> >      >> > | 🧒🏼
> >      >> > | 🧒🏽
> >      >> > | 🧒🏾
> >      >> > | 🧒🏿
> >      >> > ": The filename, directory name, or volume label syntax is
> incorrect
> >
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list