[R] naive question

Fri Jul 2 03:49:49 CEST 2004

Richard,

 Thank you for the analysis. I don't think there is an inconsistency
between the factor of 4 you've found in your example and 20 - 50 I found
in my data. I guess the major cause of the difference lies with the
structure of your data set. Specifically, your test data set differs
from mine in two respects:
* you have fewer lines, but each line contains many more fields (12500 *
800 in your case and 3.8M * 10 in my)
* all of your data fields are doubles, not strings. I have a mixture of
doubles and strings.

I posted a more technical message to r-devel where I discussed possible
reasons for the IO slowness. One of them is that R is slow at making
strings. So if you try to read your data as strings,
colClasses=rep("character", 800), I'd guess you will see a very
different timing. Even simple reshaping of your matrix, say make it
(12500*80) rows by 10 columns, will considerably worsen it.
Please let me know the results if you do anything of the above.

In my message to r-devel you may also find some timing that supports my
estimates.

Thanks,
Vadim

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Richard A. O'Keefe
> Sent: Thursday, July 01, 2004 5:22 PM
> To: r-help at stat.math.ethz.ch
> Subject: RE: [R] naive question
> 
> As part of a continuing thread on the cost of loading large 
> amounts of data into R,
> 
> "Vadim Ogranovich" <vograno at evafunds.com> wrote:
> 	R's IO is indeed 20 - 50 times slower than that of 
> equivalent C code
> 	no matter what you do, which has been a pain for some of us.
> 
> I wondered to myself just how bad R is at reading, when it is 
> given a fair chance.  So I performed an experiment.
> My machine (according to "Workstation Info") is a SunBlade 
> 100 with 640MB of physical memory running SunOS 5.9 Generic, 
> according to fpversion this is an Ultra2e with the CPU clock 
> running at 500MHz and the main memory clock running at 84MHz 
> (wow, slow memory).  R.version is platform sparc-sun-solaris2.9
> arch     sparc               
> os       solaris2.9          
> system   sparc, solaris2.9   
> status                       
> major    1                   
> minor    9.0                 
> year     2004                
> month    04                  
> day      12                  
> language R                   
> and althnough this is a 64-bit machine, it's a 32-bit 
> installation of R.
> 
> The experiment was this:
> (1) I wrote a C program that generated 12500 rows of 800 columns, the
>     numbers were integers 0..999,999,999 generated using drand48().
>     These numbers were written using printf().  It is possible to do
>     quite a bit better by avoiding printf(), but that would ruin the
>     spirit of the comparison, which is to see what can be done with
>     *straightforward* code using *existing* library functions.
> 
>     21.7 user + 0.9 system = 22.6 cpu seconds; 109 real seconds.
> 
>     The sizes were chosen to get 100MB; the actual size was
>     12500 (lines) 10000000 (words) 100012500 (bytes)
> 
> (2) I wrote a C program that read these numbers using 
> scanf("%d"); it    
>     "knew" there were 800 numbers per row and 12500 numbers in all.
>     Again, it is possible to do better by avoiding scanf(), but the
>     point is to look at *straightforward* code.
> 
>     18.4 user + 0.6 system = 19.0 cpu seconds; 100 real seconds.
> 
> (3) I started R, played around a bit doing other things, then 
> issued this
>     command:
> 
>     > system.time(xx <- read.table("/tmp/big.dat", 
> header=FALSE, quote="",
>     + row.names=NULL, colClasses=rep("numeric",800), nrows=12500,
>     + comment.char="")
> 
>     So how long _did_ it take to read 100MB on this machine?
> 
>     71.4 user + 2.2 system = 73.5 cpu seconds; 353 real seconds.
> 
> The result:  the R/C ratio was less than 4, whether you 
> measure cpu time or real time.  It certainly wasn't anywhere 
> near 20-50 times slower.
> 
> Of course, *binary* I/O in C *would* be quite a bit faster:
> (1') generate same integers but write a row at a time using fwrite():
>      5 seconds cpu, 25 seconds real; 40 MB.
> 
> (2') read same integers a row at a time using fread()
>      0.26 seconds cpu, 1 second real.
> 
> This would appear to more than justify "20-50 times slower", 
> but reading binary data and reading data in a textual 
> representation are different things, "less than 4 times 
> slower" is the fairer measure.  However, it does emphasise 
> the usefulness of problem-specific bulk reading techniques.
> 
> I thought I'd give you another R measurement:
> > system.time(xx <- read.table("/tmp/big.dat", header=FALSE))
> But I got sick of waiting for it, and killed it after 843 cpu seconds,
> 3075 real seconds.  Without knowing how far it had got, one 
> can say no more than that this is at least 10 times slower 
> than the more informed call to read.table.
> 
> What this tells me is that if you know something about the 
> data that you _could_ tell read.table about, you do yourself 
> no favour by keeping read.table in the dark.  All those 
> options are there for a reason, and it *will* pay to use them.
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://www.stat.math.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
> http://www.R-project.org/posting-guide.html
> 
>