[R] Opinion: Why I find factors convenient to use

Rui Barradas ruipbarradas at sapo.pt
Sat Aug 18 00:38:33 CEST 2012


Hello,

Em 17-08-2012 20:27, Bert Gunter escreveu:
> ... so it may be just the way object.size() counts in the two cases, right?

Or maybe the way character vectors and factors are coded.
(64 bit Windows 7 or ubuntu 12.04) 80k for the character vector seems to 
be 8 * 1e4 for pointers plus room for the strings themselves, and 40k 
for the factor seems more like 32 bit ints * 1e4 in consecutive memory 
locations. I confess to being too lazy to go check the sources, but if 
this is the case then it's an other point to factors, they are indeed 
more efficient memory-wise.
And 64 bit OSs are to become more and more used, processors aren't 
becoming worse.

There is also the statistical side of it. Factors are the natural way of 
coding nominal or categorical variables. The small/medium/large example 
is a good one. Or seasons, we like to see Fall or Autumn after Spring 
and Summer, not before. (btw, does anyone know why M/F?) And this has 
nothing to do with the usefullness of charaters, I like persons' names 
to be names, alphabetic.

I've also made a simple check, apparently, character vectors are kept as 
a vector of pointers and a vector of unique strings. If we change one of 
the strings, even for something smaller, occupying less bytes, 
object.size will report an increase in size. Try x[1] <- "a" and see the 
new size of x. It's bigger and the number of pointers to strings is the 
same.

For 32 and 64 bit Windows 7 and for 64 bit ubuntu 12.04, R was:
 > R.version
[...]
version.string R version 2.15.1 (2012-06-22)
nickname       Roasted Marshmallows

Rui Barradas
>
> -- Bert
>
> On Fri, Aug 17, 2012 at 11:42 AM, Peter Langfelder
> <peter.langfelder at gmail.com> wrote:
>> On Fri, Aug 17, 2012 at 11:34 AM, Rui Barradas <ruipbarradas at sapo.pt> wrote:
>>> Hello,
>>>
>>> No, factors may use less memory. System dependent?
>> I think it's a 32-bit vs. 64-bit distinction - I get Rui's results on
>> 64-bit Windows and Linux installation, but Bert's result on a 32-bit
>> Linux machine.
>>
>> Peter
>>
>>>> x <-sample(c("small","medium","large"),1e4,rep=TRUE)
>>>> y <- factor(x)
>>>> object.size(x)
>>> 80184 bytes
>>>> object.size(y)
>>> 40576 bytes
>
>




More information about the R-help mailing list