[R] naive question
vograno at evafunds.com
Fri Jul 2 03:49:49 CEST 2004
Thank you for the analysis. I don't think there is an inconsistency
between the factor of 4 you've found in your example and 20 - 50 I found
in my data. I guess the major cause of the difference lies with the
structure of your data set. Specifically, your test data set differs
from mine in two respects:
* you have fewer lines, but each line contains many more fields (12500 *
800 in your case and 3.8M * 10 in my)
* all of your data fields are doubles, not strings. I have a mixture of
doubles and strings.
I posted a more technical message to r-devel where I discussed possible
reasons for the IO slowness. One of them is that R is slow at making
strings. So if you try to read your data as strings,
colClasses=rep("character", 800), I'd guess you will see a very
different timing. Even simple reshaping of your matrix, say make it
(12500*80) rows by 10 columns, will considerably worsen it.
Please let me know the results if you do anything of the above.
In my message to r-devel you may also find some timing that supports my
> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of
> Richard A. O'Keefe
> Sent: Thursday, July 01, 2004 5:22 PM
> To: r-help at stat.math.ethz.ch
> Subject: RE: [R] naive question
> As part of a continuing thread on the cost of loading large
> amounts of data into R,
> "Vadim Ogranovich" <vograno at evafunds.com> wrote:
> R's IO is indeed 20 - 50 times slower than that of
> equivalent C code
> no matter what you do, which has been a pain for some of us.
> I wondered to myself just how bad R is at reading, when it is
> given a fair chance. So I performed an experiment.
> My machine (according to "Workstation Info") is a SunBlade
> 100 with 640MB of physical memory running SunOS 5.9 Generic,
> according to fpversion this is an Ultra2e with the CPU clock
> running at 500MHz and the main memory clock running at 84MHz
> (wow, slow memory). R.version is platform sparc-sun-solaris2.9
> arch sparc
> os solaris2.9
> system sparc, solaris2.9
> major 1
> minor 9.0
> year 2004
> month 04
> day 12
> language R
> and althnough this is a 64-bit machine, it's a 32-bit
> installation of R.
> The experiment was this:
> (1) I wrote a C program that generated 12500 rows of 800 columns, the
> numbers were integers 0..999,999,999 generated using drand48().
> These numbers were written using printf(). It is possible to do
> quite a bit better by avoiding printf(), but that would ruin the
> spirit of the comparison, which is to see what can be done with
> *straightforward* code using *existing* library functions.
> 21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.
> The sizes were chosen to get 100MB; the actual size was
> 12500 (lines) 10000000 (words) 100012500 (bytes)
> (2) I wrote a C program that read these numbers using
> scanf("%d"); it
> "knew" there were 800 numbers per row and 12500 numbers in all.
> Again, it is possible to do better by avoiding scanf(), but the
> point is to look at *straightforward* code.
> 18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.
> (3) I started R, played around a bit doing other things, then
> issued this
> > system.time(xx <- read.table("/tmp/big.dat",
> header=FALSE, quote="",
> + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
> + comment.char="")
> So how long _did_ it take to read 100MB on this machine?
> 71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.
> The result: the R/C ratio was less than 4, whether you
> measure cpu time or real time. It certainly wasn't anywhere
> near 20-50 times slower.
> Of course, *binary* I/O in C *would* be quite a bit faster:
> (1') generate same integers but write a row at a time using fwrite():
> 5 seconds cpu, 25 seconds real; 40 MB.
> (2') read same integers a row at a time using fread()
> 0.26 seconds cpu, 1 second real.
> This would appear to more than justify "20-50 times slower",
> but reading binary data and reading data in a textual
> representation are different things, "less than 4 times
> slower" is the fairer measure. However, it does emphasise
> the usefulness of problem-specific bulk reading techniques.
> I thought I'd give you another R measurement:
> > system.time(xx <- read.table("/tmp/big.dat", header=FALSE))
> But I got sick of waiting for it, and killed it after 843 cpu seconds,
> 3075 real seconds. Without knowing how far it had got, one
> can say no more than that this is at least 10 times slower
> than the more informed call to read.table.
> What this tells me is that if you know something about the
> data that you _could_ tell read.table about, you do yourself
> no favour by keeping read.table in the dark. All those
> options are there for a reason, and it *will* pay to use them.
> R-help at stat.math.ethz.ch mailing list
> PLEASE do read the posting guide!
More information about the R-help