[Rd]  R-4.3 version list.files function could not work correctly in chinese
    Tomas Kalibera 
    tom@@@k@||ber@ @end|ng |rom gm@||@com
       
    Tue Aug 15 08:38:11 CEST 2023
    
    
  
On 8/13/23 13:16, Ivan Krylov wrote:
> Found it! Looks like a buffer length problem. This isn't limited to
> Chinese, just more likely to happen when a character takes three bytes
> to represent in UTF-8. (Any filename containing characters which take
> more than one byte to represent in UTF-8 may fail.)
>
> If a directory contains a file with a sufficiently long name,
> FindNextFile() fails with ERROR_MORE_DATA (0xEA, 234), making
> R_readdir() return NULL, stopping list_files() prematurely:
>
> # everything seems to work fine...
>
> list.files("测试文件")
> # [1] "测试中文-non-utf8-ЪЪЪЪЪ
> 测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文.txt"
> # [2] "测试中文-non-utf8-ЪЪЪЪЪ.txt"
> # [3] "测试中文-utf-8.txt"
>
> # now create a file with an even longer name
>
> list.files("测试文件")
> # [1] "测试中文-non-utf8-ЪЪЪЪЪ
> 测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文测试中文.txt"
>
> # the files are still there, but not visible to list.files():
Thanks, Ivan, could you please turn this into a complete minimal 
reproducible example, ideally with only ASCII characters (if enough to 
trigger)? Or any reproducible example would do. I would have a look 
later today.
>
> system("cmd /c dir /s *.txt")
> #  Volume in drive C has no label.
> #  Volume Serial Number is A85A-AA74
> #
> #  Directory of C:\R\R-4.3.1\bin\x64\????
> #
> # 08/12/2023  07:57 AM                22 ????-non-utf8-?????
> ????????????????????????????????????????????????????.txt
> # 08/12/2023 07:57 AM                22 ????-non-utf8-?????
> ????????????????????????????????????????????????????????????????????????????????????????????????????????.txt
> # 08/12/2023  07:57 AM                22 ????-non-utf8-?????.txt
> # 08/12/2023  07:56 AM                18 ????-utf-8.txt
> # 4 File(s)             84 bytes
> #
> #       Total Files Listed:
> #                4 File(s)             84 bytes
> #                0 Dir(s)  29,281,538,048 bytes free
> # [1] 0
>
> Increasing the path length limits [*] doesn't help, since it's the
> filename length limit that we're bumping against. While both
> WIN32_FIND_DATAA and WIN32_FIND_DATAW contain fixed-size buffers, a
> valid filename may take more than MAX_PATH bytes to represent in UTF-8
> while still being under the limit of MAX_PATH wide characters. This may
> mean having to rewrite list_files in terms of R_wopendir()/R_wreaddir()
> for Windows. As a workaround, we may use the short filename (which
> sometimes may not exist, alas) when FindNextFile() fails with
> ERROR_MORE_DATA.
I admit I didn't get your analysis. However, I've rewritten this code 
for R 4.3 to support long paths (when enabled in the system), more in 
https://blog.r-project.org/2023/03/07/path-length-limit-on-windows/index.html. 
As this was reported to be regression in 4.3, it is entirely possible 
this change came with a regression (though a bit surprising we didn't 
catch it earlier by testing), so it would be a great help if I could 
have the example and debug it.
Thanks,
Tomas
    
    
More information about the R-devel
mailing list