[R] Why are big data.frames slow? What can I do to get it faster?

Marcus Jellinghaus Marcus_Jellinghaus at gmx.de
Mon Oct 7 13:08:54 CEST 2002


First I want to say "thank you" to everybody who replied.
I understand that vectorized operations instead of the loop are faster.
I also made sure not to use factors.

Since the loop runs 100times in my example, the loop should only take the
time of the vectorized operation mutliplied by 100.
But the loop takes about 10 minutes, the  vectorized operation takes about 3
seconds. (See below)
Why that? Shouldn´t the loop take max 100*3seconds = 5 minutes?

I´m interested in that because I think that I will have computations that
are easily vectorizable(like this example) and that I will have computations
that are not/very difficult vectorizable.

Marcus Jellinghaus


> print(dim(test)[1])
[1] 500000
> Sys.time()
[1] "2002-10-07 06:17:33 Eastern Sommerzeit"
> test[1:100,6] = paste(test[1:100,2],"-",test[1:100,3], sep = "")
> Sys.time()
[1] "2002-10-07 06:17:35 Eastern Sommerzeit"

[..]

> print(dim(test)[1])
[1] 500000
> Sys.time()
[1] "2002-10-07 06:05:29 Eastern Sommerzeit"
> for(i in 1:100) {
+   test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
+ }
> Sys.time()
[1] "2002-10-07 06:15:17 Eastern Sommerzeit"


-----Ursprüngliche Nachricht-----
Von: Uwe Ligges [mailto:ligges at statistik.uni-dortmund.de]
Gesendet: Sunday, October 06, 2002 1:58 PM
An: Marcus Jellinghaus
Cc: r-help at stat.math.ethz.ch
Betreff: Re: [R] Why are big data.frames slow? What can I do to get it
faster?


Marcus Jellinghaus wrote:
>
> Hello,
>
> I´m quite new to this list.
> I have a high frequency-dataset with more than 500.000 records.
> I want to edit a data.frame "Test". My small programm runs fine with a
small
> part of the dataset (just 100 records), but it is very slow with a huge
> dataset. Of course it get´s slower with more records, but when I change
just
> the size of the frame and keep the number of edited records fixed, I see
> that it is also getting slower.
>
> Here is my program:
>
> print(dim(test)[1])
> Sys.time()
> for(i in 1:100) {
>   test[i,6] = paste(test[i,2],"-",test[i,3], sep = "")
> }
> Sys.time()
>
> I connect 2 currency symbols to a currency pair.
> I always calculate only for the first 100 lines.
> WHen I load just 100 lines in the data.frame "test", it takes 1 second.
> When I load 1000 lines, editing 100 lines takes 2 seconds,
> 10,000 lines loaded and 100 lines editing takes 5 seconds,
> 100,000 lines loaded and editing 100 lines takes 31 seconds,
> 500,000 lines loaded and editing 100 lines takes 11 minutes(!!!).
>
> My computer has 1 GB Ram, so that shouldn´t be the reason.
>
> Of course, I could work with many small data.frames instead of one big,
but
> the program above is just the very first step and so I don´t want to
split.
>
> Is there a way to edit big data.frames without waiting for a long time?

Well, the point is, I guess, to address elements in a large data.frame,
which reasonably takes much more time than in a small one.

Maybe it's an idea to use vectorized operations instead of the loop,
which is preferable, if your computation is easy vectorizable without a
big penalty of memory consumption:

 test[1:100, 6] <- paste(test[1:100, 2], "-", test[1:100, 3], sep = "")
or
 test[ , 6] <- paste(test[ , 2], "-", test[ , 3], sep = "")
for the whole data.frame.

Uwe Ligges

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._



More information about the R-help mailing list