[Rd] match function causing bad performance when using tablefunction on factors with multibyte characters on Windows

Karl Ove Hufthammer karl at huftis.org
Wed Jan 26 09:31:43 CET 2011

Simon Urbanek wrote:

>> I could *not* reproduce it; that is, ‘table’ is as fast on the non-ASCII
>> factor as it is on the ASCII factor.
> Strange - are you sure you get the right locale names? Make sure it's
> listed in locale -a.

Yes, I managed to reproduce it now, using a locale listed in ‘locale -a’.
There is a performance hit, though *much* smaller than on Windows.

> FWIW if you care about speed you should use tabulate() instead - it's much
> faster and incurs no penalty:

Yes, that the solution I ended up using:

res = tabulate(x, nbins=nlevels(x)) # nbins needed for levels that don’t occur
names(res) = levels(x)

(Though I’m not sure it’s *guaranteed* that factors are internally stored in a
way that make this works, i.e., as the numbers 1, 2, ... for level 1, 2 ...)

Anyway, do you think it’s worth trying to change the ‘table’ function the way I
outlined in my first post¹? This should eliminate the performance hit on all
platforms. However, it will introduce a performance hit (CPU and memory use)
if the elements of ‘exclude’ make up a large part of the factor(s).

¹ http://permalink.gmane.org/gmane.comp.lang.r.devel/26576

Karl Ove Hufthammer

More information about the R-devel mailing list