[R] numerical accuracy, dumb question

Tony Plate tplate at blackmesacapital.com
Sat Aug 14 15:42:31 CEST 2004


At Friday 08:41 PM 8/13/2004, Marc Schwartz wrote:
>Part of that decision may depend upon how big the dataset is and what is
>intended to be done with the ID's:
>
> > object.size(1011001001001)
>[1] 36
>
> > object.size("1011001001001")
>[1] 52
>
> > object.size(factor("1011001001001"))
>[1] 244
>
>
>They will by default, as Andy indicates, be read and stored as doubles.
>They are too large for integers, at least on my system:
>
> > .Machine$integer.max
>[1] 2147483647
>
>Converting to a character might make sense, with only a minimal memory
>penalty. However, using a factor results in a notable memory penalty, if
>the attributes of a factor are not needed.

That depends on how long the vectors are.  The memory overhead for factors 
is per vector, with only 4 bytes used for each additional element (if the 
level already appears).  The memory overhead for character data is per 
element -- there is no amortization for repeated values.

 > object.size(factor("1011001001001"))
[1] 244
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))
[1] 308
 > # bytes per element in factor, for length 4:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),1)))/4
[1] 77
 > # bytes per element in factor, for length 1000:
 > 
object.size(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250)))/1000
[1] 4.292
 > # bytes per element in character data, for length 1000:
 > 
object.size(as.character(factor(rep(c("1011001001001","111001001001","001001001001","011001001001"),250))))/1000
[1] 20.028
 >

So, for long vectors with relatively few different values, storage as 
factors is far more memory efficient (this is because the character data is 
stored only once per level, and each element is stored as a 4-byte 
integer).  (The above was done on Windows 2000).

-- Tony Plate

>If any mathematical operations are to be performed with the ID's then
>leaving them as doubles makes most sense.
>
>Dan, more information on the numerical characteristics of your system
>can be found by using:
>
>.Machine
>
>See ?.Machine and ?object.size for more information.
>
>HTH,
>
>Marc Schwartz
>
>
>On Fri, 2004-08-13 at 21:02, Liaw, Andy wrote:
> > If I'm not mistaken, numerics are read in as doubles, so that shouldn't 
> be a
> > problem.  However, I'd try using factor or character.
> >
> > Andy
> >
> > > From: Dan Bolser
> > >
> > > I store an id as a big number, could this be a problem?
> > >
> > > Should I convert to at string when I use read.table(...
> > >
> > > example id's
> > >
> > > 1001001001001
> > > 1001001001002
> > > ...
> > > 1002001002005
> > >
> > >
> > > Bigest is probably
> > >
> > > 1011001001001
> > >
> > > Ta,
> > > Dan.
> > >
>
>______________________________________________
>R-help at stat.math.ethz.ch mailing list
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html




More information about the R-help mailing list