[R] numerical accuracy, dumb question

Sat Aug 14 19:01:59 CEST 2004

On Sat, 2004-08-14 at 08:42, Tony Plate wrote:
> At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
> >Part of that decision may depend upon how big the dataset is and what is
> >intended to be done with the ID's:
> >
> > > object.size(1011001001001)
> >[1] 36
> >
> > > object.size("1011001001001")
> >[1] 52
> >
> > > object.size(factor("1011001001001"))
> >[1] 244
> >
> >
> >They will by default, as Andy indicates, be read and stored as doubles.
> >They are too large for integers, at least on my system:
> >
> > > .Machine$integer.max
> >[1] 2147483647
> >
> >Converting to a character might make sense, with only a minimal memory
> >penalty. However, using a factor results in a notable memory penalty, if
> >the attributes of a factor are not needed.
> 
> That depends on how long the vectors are.  The memory overhead for factors 
> is per vector, with only 4 bytes used for each additional element (if the 
> level already appears).  The memory overhead for character data is per 
> element -- there is no amortization for repeated values.
> 
>  > object.size(factor("1011001001001"))
> [1] 244
>  > 
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
> [1] 308
>  > # bytes per element in factor, for length 4:
>  > 
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
> [1] 77
>  > # bytes per element in factor, for length 1000:
>  > 
> object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
> [1] 4.292
>  > # bytes per element in character data, for length 1000:
>  > 
> object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
> [1] 20.028
>  >
> 
> So, for long vectors with relatively few different values, storage as 
> factors is far more memory efficient (this is because the character data is 
> stored only once per level, and each element is stored as a 4-byte 
> integer).  (The above was done on Windows 2000).
> 
> -- Tony Plate

Good point Tony. I was making the, perhaps incorrect assumption, that
the ID's were unique or relatively so. However, as it turns out, even
that assumption is relevant only to a certain extent with respect to how
much memory is required.

What is interesting (and presumably I need to do some more reading on
how R stores objects internally) is that the incremental amount of
memory is not consistent on a per element basis for a given object,
though there is a pattern. It is also dependent upon the size of the new
elements to be added, as I note at the bottom.

This all of course presumes that object.size() is giving a reasonable
approximation of the amount of memory actually allocated to an object,
for which the notes in ?object.size raise at least some doubt. This is a
critical assumption for the data below, which is on FC2 on a P4.

For example:

> object.size("a")
[1] 44

> object.size(letters)
[1] 340

In the second case, as Tony has noted, the size of letters (a character
vector) is not 26 * 44.

Now note:

> object.size(c("a", "b"))
[1] 52
> object.size(c("a", "b", "c"))
[1] 68
> object.size(c("a", "b", "c", "d"))
[1] 76
> object.size(c("a", "b", "c", "d", "e"))
[1] 92

The incremental sizes are a sequence of 8 and 16.

Now for a factor:

> object.size(factor("a"))
[1] 236
> object.size(factor(c("a", "b")))
[1] 244
> object.size(factor(c("a", "b", "c")))
[1] 268
> object.size(factor(c("a", "b", "c", "d")))
[1] 276
> object.size(factor(c("a", "b", "c", "d", "e")))
[1] 300

The incremental sizes are a sequence of 8 and 24.

Using elements along the lines of Dan's:

> object.size("1000000000000")
[1] 52
> object.size(c("1000000000000", "1000000000001"))
[1] 68
> object.size(c("1000000000000", "1000000000001", "1000000000002"))
[1] 92
> object.size(c("1000000000000", "1000000000001", "1000000000002",
                "1000000000003"))
[1] 108
> object.size(c("1000000000000", "1000000000001", "1000000000002",
                "1000000000003", "1000000000004"))
[1] 132

The sequence is 16 and 24.

For factors:

> object.size(factor("1000000000000")
[1] 244
> object.size(factor(c("1000000000000", "1000000000001")))
[1] 260
> object.size(factor(c("1000000000000", "1000000000001",
                       "1000000000002")))
[1] 292
> object.size(factor(c("1000000000000", "1000000000001",
                       "1000000000002", "1000000000003")))
[1] 308
> object.size(factor(c("1000000000000", "1000000000001",
                       "1000000000002", "1000000000003",
                       "1000000000004")))
[1] 340

The sequence is 24 and 32.

So, the incremental size seems to alternate as elements are added. 

The behavior above would perhaps suggest that memory is allocated to
objects to enable pairs of elements to be added. When the second element
of the pair is added, only a minimal incremental amount of additional
memory (and presumably time) is required.

However, when I add a "third" element, there is additional memory
required to store that new element because the object needs to be
adjusted in a more fundamental way to handle this new element.

There also appears to be some memory allocation "adjustment" at play
here. Note:

> object.size(factor("1000000000000"))
[1] 244

> object.size(factor("1000000000000", "a"))
[1] 236

In the second case, the amount of memory reported actually declines by 8
bytes. This suggests (to some extent consistent with my thoughts above)
that when the object is initially created, there is space for two new
elements and that space is allocated based upon the size of the first
element. When the second element is added, the space required is
adjusted based upon the actual size of the second element.

Again, all of the above presumes that object.size() is reporting correct
information.

Thanks,

Marc