[Rd] [R] data.frame() size

Prof Brian Ripley ripley at stats.ox.ac.uk
Mon Dec 12 13:06:20 CET 2005


Data frames have unique row names *by definition* (White Book p.57).

Note that R is extensible, so any package writer has (for 14 years since 
the White Book) been entitled to assume that.  A minimum test suite is to 
run R CMD check on all CRAN packages, and to read all the relevant 
documentation.  That would reveal a large number of uses of row names and 
of their uniqueness.

On Mon, 12 Dec 2005, Matthew Dowle wrote:

>
> I guess the mail list precludes attachments then, makes sense. I have sent
> the modified source directly to anyone who has asked.
>
> I had a look at the light-weight data.frame class post
> (http://tolstoy.newcastle.edu.au/R/devel/05/05/0837.html) :
>
>> Now the transcript itself:
>> # the motivation: subscription of a data.frame is *much* (almost 20
> times) slower than that of a list
>> # compare
>> n = 1e6
>> i = seq(n)
>> x = data.frame(a=seq(n), b=seq(n))
>> system.time(x[i,], gcFirst=TRUE)
> [1] 1.01 0.14 1.14 0.00 0.00
>>
>> x = list(a=seq(n), b=seq(n))
>> system.time(lapply(x, function(col) col[i]), gcFirst=TRUE)
> [1] 0.06 0.00 0.06 0.00 0.00
>>
>> # the solution: define methods for the light-weight data.frame class
>> lwdf = function(...) structure(list(...), class = "lwdf")
>> ...
>
> But if I have understood correctly I think the time difference here is just
> down to the rownames. The rownames are 1:n stored in character form. This
> takes the most time and space in this example, but are never used. I'm not
> sure why 1:n in character form would ever be useful in fact. Running the
> example above with my modifications appears to fix the problem ie negligible
> time difference. I needed to make a one line change to [.data.frame, and
> I've sent that to anyone who requested the code.
>
> I can see the problem :
>
>> apropos("data.frame")
> [1] "[.data.frame"                  "as.matrix.data.frame"
> "data.frame"                    "dim.data.frame"
> [5] "format.data.frame"             "print.data.frame"
> ".__C__data.frame"              "aggregate.data.frame"
> [9] "$<-.data.frame"                "Math.data.frame"
> "Ops.data.frame"                "Summary.data.frame"
> [13] "[.data.frame"                  "[<-.data.frame"
> "[[.data.frame"                 "[[<-.data.frame"
> [17] "as.data.frame"                 "as.data.frame.AsIs"
> "as.data.frame.Date"            "as.data.frame.POSIXct"
> [21] "as.data.frame.POSIXlt"         "as.data.frame.array"
> "as.data.frame.character"       "as.data.frame.complex"
> [25] "as.data.frame.data.frame"      "as.data.frame.default"
> "as.data.frame.factor"          "as.data.frame.integer"
> [29] "as.data.frame.list"            "as.data.frame.logical"
> "as.data.frame.matrix"          "as.data.frame.model.matrix"
> [33] "as.data.frame.numeric"         "as.data.frame.ordered"
> "as.data.frame.package_version" "as.data.frame.raw"
> [37] "as.data.frame.table"           "as.data.frame.ts"
> "as.data.frame.vector"          "as.list.data.frame"
> [41] "as.matrix.data.frame"          "by.data.frame"
> "cbind.data.frame"              "data.frame"
> [45] "dim.data.frame"                "dimnames.data.frame"
> "dimnames<-.data.frame"         "duplicated.data.frame"
> [49] "format.data.frame"             "is.data.frame"
> "is.na.data.frame"              "mean.data.frame"
> [53] "merge.data.frame"              "print.data.frame"
> "rbind.data.frame"              "row.names.data.frame"
> [57] "row.names<-.data.frame"        "rowsum.data.frame"
> "split.data.frame"              "split<-.data.frame"
> [61] "stack.data.frame"              "subset.data.frame"
> "summary.data.frame"            "t.data.frame"
> [65] "transform.data.frame"          "unique.data.frame"
> "unstack.data.frame"            "xpdrows.data.frame"
>>
>
> But I think the changes would be quick to make. Is anything else effected?
> Do any test suites exist to confirm R hasn't broken?
> On the face of it allowing data frames to have null row names seems a small
> change, and would make them consistent with matrices, with large time and
> space benefits. However, I can see the argument for a new class instead for
> safety. Whats the consenus?
>
>
>
> -----Original Message-----
> From: Hin-Tak Leung [mailto:hin-tak.leung at cimr.cam.ac.uk]
> Sent: 09 December 2005 18:41
> To: Gabor Grothendieck
> Cc: Matthew Dowle; r-devel at r-project.org; Peter Dalgaard
> Subject: Re: [Rd] [R] data.frame() size
>
>
> Gabor Grothendieck wrote:
>> There was nothing attached in the copy that came through
>> to me.
>
> I like to see that patch also.
>
>> By the way, there was some discussion earlier this year
>> on a light-weight data.frame class but I don't think anyone ever
>> posted any code.
>
> It may have been me. I am working on a bit-packed data.frame which only uses
> 2-bits per unit of data, so it is 4 units per RAWSXP. (work in progress,
> nothing to show).
>
> So I am very interested to see the patch.
>
> Yes, I took a couple of weeks reading/learning where have all the memory
> gone in data.frame. The rowname/column names allocation is a bit stupid.
> Each rowname and each column name is a full R object, so there is a 32(or
> 28) byte overhead just from managing that, before the STRSXP for the actual
> string, which is another X bytes. so for an 1 x N data.frame with integers
> for content, the the content is 4-byte * N, but the rowname/columnname is 32
> * N -ish. (a 9x increase). Word is 32-bit on most people's machines, and I
> am counting the extra one from which you have to keep the address of each
> SEXPREC somewhere, so it is 7+1 = 8, if I understand it correctly.
>
> Here is the relevant comment, quoted verbatum from around line 225 of
> "src/include/Rinternals.h":
>
> /* The generational collector uses a reduced version of SEXPREC as a
>    header in vector nodes.  The layout MUST be kept consistent with
>    the SEXPREC definition.  The standard SEXPREC takes up 7 words on
>    most hardware; this reduced version should take up only 6 words.
>    In addition to slightly reducing memory use, this can lead to more
>    favorable data alignment on 32-bit architectures like the Intel
>    Pentium III where odd word alignment of doubles is allowed but much
>    less efficient than even word alignment. */
>
> Hin-Tak Leung
>
>> On 12/9/05, Matthew Dowle <mdowle at concordiafunds.com> wrote:
>>
>>> Hi,
>>>
>>> Please see below for post on r-help regarding data.frame() and the
>>> possibility of dropping rownames, for space and time reasons. I've
>>> made some changes, attached, and it seems to be working well. I see
>>> the expected space (90% saved) and time (10 times faster) savings.
>>> There are no doubt some bugs, and needs more work and testing, but I
>>> thought I would post first at this stage.
>>>
>>> Could some changes along these lines be made to R ? I'm happy to help
>>> with testing and further work if required. In the meantime I can work
>>> with overloaded functions which fixes the problems in my case.
>>>
>>> Functions effected :
>>>
>>>  dim.data.frame
>>>  format.data.frame
>>>  print.data.frame
>>>  data.frame
>>>  [.data.frame
>>>  as.matrix.data.frame
>>>
>>> Modified source code attached.
>>>
>>> Regards,
>>> Matthew
>>>
>>>
>>> -----Original Message-----
>>> From: Matthew Dowle
>>> Sent: 09 December 2005 09:44
>>> To: 'Peter Dalgaard'
>>> Cc: 'r-help at stat.math.ethz.ch'
>>> Subject: RE: [R] data.frame() size
>>>
>>>
>>>
>>> That explains it. Thanks. I don't need rownames though, as I'll only
>>> ever use integer subscripts. Is there anyway to drop them, or even
>>> better not create them in the first place? The memory saved (90%) by
>>> not having them and 10 times speed up would be very useful. I think I
>>> need a data.frame rather than a matrix because I have columns of
>>> different types in real life.
>>>
>>>
>>>> rownames(d) = NULL
>>>
>>> Error in "dimnames<-.data.frame"(`*tmp*`, value = list(NULL, c("a", "b" :
>>>       invalid 'dimnames' given for data frame
>>>
>>>
>>> -----Original Message-----
>>> From: pd at pubhealth.ku.dk [mailto:pd at pubhealth.ku.dk] On Behalf Of
>>> Peter Dalgaard
>>> Sent: 08 December 2005 18:57
>>> To: Matthew Dowle
>>> Cc: 'r-help at stat.math.ethz.ch'
>>> Subject: Re: [R] data.frame() size
>>>
>>>
>>> Matthew Dowle <mdowle at concordiafunds.com> writes:
>>>
>>>
>>>> Hi,
>>>>
>>>> In the example below why is d 10 times bigger than m, according to
>>>> object.size ? It also takes around 10 times as long to create, which
>>>> fits with object.size() being truthful.  gcinfo(TRUE) also indicates
>>>> a great deal more garbage collector activity caused by data.frame()
>>>> than matrix().
>>>>
>>>> $ R --vanilla
>>>> ....
>>>>
>>>>> nr = 1000000
>>>>> system.time(m<<-matrix(integer(1), nrow=nr, ncol=2))
>>>>
>>>> [1] 0.22 0.01 0.23 0.00 0.00
>>>>
>>>>> system.time(d<<-data.frame(a=integer(nr), b=integer(nr)))
>>>>
>>>> [1] 2.81 0.20 3.01 0.00 0.00                  # 10 times longer
>>>>
>>>>
>>>>> dim(m)
>>>>
>>>> [1] 1000000       2
>>>>
>>>>> dim(d)
>>>>
>>>> [1] 1000000       2                           # same dimensions
>>>>
>>>>
>>>>> storage.mode(m)
>>>>
>>>> [1] "integer"
>>>>
>>>>> sapply(d, storage.mode)
>>>>
>>>>        a         b
>>>> "integer" "integer"                           # same storage.mode
>>>>
>>>>
>>>>> object.size(m)/1024^2
>>>>
>>>> [1] 7.629616
>>>>
>>>>> object.size(d)/1024^2
>>>>
>>>> [1] 76.29482                                  # but 10 times bigger
>>>>
>>>>
>>>>> sum(sapply(d, object.size))/1024^2
>>>>
>>>> [1] 7.629501                                  # or is it ?    If its not
>>>> really 10 times bigger, why 10 times longer above ?
>>>
>>> Row names!!
>>>
>>>
>>>
>>>> r <- as.character(1:1e6)
>>>> object.size(r)
>>>
>>> [1] 72000056
>>>
>>>> object.size(r)/1024^2
>>>
>>> [1] 68.6646
>>>
>>> 'nuff said?
>>>
>>> --
>>>  O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
>>> c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
>>> (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45)
> 35327918
>>> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45)
> 35327907
>>>
>>>
>>>
>>>
>>> ______________________________________________
>>> R-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>>
>>>
>>
>>
>> ______________________________________________
>> R-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-devel mailing list