[R] R Memory Usage Concerns

Tue Sep 15 06:26:08 CEST 2009

On Mon, Sep 14, 2009 at 8:35 PM, jim holtman <jholtman at gmail.com> wrote:
> When you read your file into R, show the structure of the object:
...

Here's the data I get:

> tab <- read.table("~/20090708.tab")
> str(tab)
'data.frame':	1797601 obs. of  3 variables:
 $ V1: Factor w/ 6 levels "biz_details",..: 4 4 4 4 4 5 6 4 1 4 ...
 $ V2: num  1.25e+09 1.25e+09 1.25e+09 1.25e+09 1.25e+09 ...
 $ V3: num  0.0141 0.0468 0.0137 0.0594 0.0171 ...
> object.size(tab)
35953640 bytes
> gc()
          used (Mb) gc trigger  (Mb) max used  (Mb)
Ncells  119580  6.4    1489330  79.6  2380869 127.2
Vcells 6647905 50.8   17367032 132.5 16871956 128.8

Forcing a GC doesn't seem to free up an appreciable amount of memory
(memory usage reported by ps is about the same), but it's encouraging
that the output from object.size shows that the object is small. I am,
however, a little bit skeptical of this:

1797601 * (4 + 8 + 8) = 35952020, which is awfully close to 35953640.
My assumption is that the first column is mapped to a 32-bit integer,
plus two 8-byte numbers for the doubles, plus a little bit of overhead
to store whatever structs for the objects and the mapping of servlet
name (i.e. to store the string -> int mapping used by the factor) to
its 32-bit representation. This seems like it might be too
conservative for me, since it implies that R allocated exactly as much
memory for the lists as there were numbers in the list (e.g. typically
in an interpreter like this you'd be allocating on order-of-two
boundaries, i.e. sizeof(obj) << 21; this is how Python lists
internally work).

Is it possible that R is counting its memory usage naively, e.g. just
adding up the size of all of the constituent objects, rather than the
amount of space it actually allocated for those objects?

-- 
Evan Klitzke <evan at eklitzke.org> :wq