[R] numerical accuracy, dumb question
    Prof Brian Ripley 
    ripley at stats.ox.ac.uk
       
    Sat Aug 14 20:19:23 CEST 2004
    
    
  
On Sat, 14 Aug 2004, Marc Schwartz wrote:
> > object.size("a")
> [1] 44
> 
> > object.size(letters)
> [1] 340
> 
> In the second case, as Tony has noted, the size of letters (a character
> vector) is not 26 * 44.
Of course not.  Both are character vectors, so have the overhead of any R
object plus an allocation for pointers to the elements plus an amount for
each element of the vector (see the end).
These calculations differ on 32-bit and 64-bit machines.  For a 32-bit
machine storage is in units of either 28 bytes (Ncells) or 8 bytes
(Vcells) so single-letter characters are wasteful, viz
> object.size("aaaaaaa")
[1] 44
That is 1 Ncell and 2 Vcells, 1 for the string (7 bytes plus terminator)
and 1 for the pointer.
Whereas
> object.size(letters)
[1] 340
has 1 Ncell and 39 Vcells, 26 for the strings and 13 for the pointers 
(which fit two to a Vcell).
Note that repeated character strings may share storage, so for example
> object.size(rep("a", 26))
[1] 340
is wrong (140, I think).  And that makes comparisons with factors depend
on exactly how they were created, for a character vector there probably is 
a lot of sharing.
I have a feeling that these calculations are off for character vectors, as 
each element is a CHARSXP and so may have an Ncell not accounted for by 
object.size.  (`May' because of potential sharing.)  Would anyone who is 
sure like to confirm or deny this?
It ought to be possible to improve the estimates for character vectors a 
bit as we can detect sharing amongst the elements.
-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595
    
    
More information about the R-help
mailing list