[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

Tue Jan 25 20:14:56 CET 2011

On Jan 25, 2011, at 5:49 AM, Karl Ove Hufthammer wrote:

> Matthew Dowle wrote:
> 
>> I'm not sure, but note the difference in locale between
>> Linux (UTF-8) and Windows (non UTF-8). As far as I
>> understand it R much prefers UTF-8, which Windows doesn't
>> natively support. Otherwise you could just change your
>> Windows locale to a UTF-8 locale to make R happier.
>> 
> [...]
>> 
>> If anybody knows a way to trick R on Linux into thinking it has
>> an encoding similar to Windows then I may be able to take a
>> look if I can reproduce the problem in Linux.
> 
> Changing the locale to an ISO 8859-1 locale, i.e.:
> 
> export LC_ALL="en_US.ISO-8859-1"
> export LANG="en_US.ISO-8859-1"
> 
> I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII 
> factor as it is on the ASCII factor.
> 

Strange - are you sure you get the right locale names? Make sure it's listed in locale -a. The above works on my Mac but on my Linux system I have to use LANG=en_US.iso88591 and is *is* replicable albeit with a much smaller hit:

> benchmark( table(x.num), table(x.fac.ascii), table(x.fac.nascii), table(unclass(x.fac.nascii)), replications=20 )
                          test replications elapsed relative user.self sys.self user.child sys.child
4 table(unclass(x.fac.nascii))           20   1.028 2.269316     1.020    0.004          0         0
2           table(x.fac.ascii)           20   0.453 1.000000     0.452    0.004          0         0
3          table(x.fac.nascii)           20   2.683 5.922737     2.684    0.000          0         0
1                 table(x.num)           20   1.028 2.269316     1.020    0.008          0         0

The main reason is that table() calls factor() which does as.character() which means 10^5 character conversions - a bad idea in that case. Why the penalty is so much higher on Windows that I can't answer at the moment as I'm not on a machine with Windows VM.

FWIW if you care about speed you should use tabulate() instead - it's much faster and incurs no penalty:

>  benchmark( tabulate(x.num), tabulate(x.fac.ascii), tabulate(x.fac.nascii), tabulate(unclass(x.fac.nascii)), replications=20 )
                             test replications elapsed relative user.self sys.self user.child sys.child
4 tabulate(unclass(x.fac.nascii))           20   0.027 1.421053     0.024        0          0         0
2           tabulate(x.fac.ascii)           20   0.023 1.210526     0.024        0          0         0
3          tabulate(x.fac.nascii)           20   0.024 1.263158     0.020        0          0         0
1                 tabulate(x.num)           20   0.019 1.000000     0.020        0          0         0

Cheers,
Simon