[R] rbind and data.frame [simplified]

Mon Dec 10 09:01:17 CET 2001

Thanks for the interest in my timing problem. I have scaled off all 
calculations in order to purify it, and it is obvious that size 
matters a lot. Also that 'matrices are faster than data frames'.

I give you the full listing here, but it is 
really the last few lines that are interesting (= slow):

The test function koll ('koll' ~ 'check', Swedish):
--------------------------------------------------------------------
koll <- function(dat, com.dat, com.ins, no.of.outrows = 1000){
  ## 'dat' is a data frame with variables:
  ## bdate = birth date
  ## enter = left truncation time
  ## exit  = right censoring/event time
  ## event = event indicator (0 if no event).
  ## other covariates.

  ## com.dat is a data frame with columns communal covariates
  ## com.ins is a description of com.dat: (Is a vector for now!)
  ## start year, period (length == 2)

  ## NOTE: any names(com.dat) must be != any names(dat) !!!

  nn <- nrow(dat)
  n.years <- nrow(com.dat)
  n.com <- ncol(com.dat) ## No. of communal covariates.
##  if (nrow(com.ins) != n.com) stop("Error in com.ins: wrong no of rows")

  iv.length <- com.ins[2]
  cuts <- com.ins[1] + c(0, (1:n.years) * iv.length)
  beg.per <- cuts[1]
  n.yearsp1 <- n.years + 1
  end.per <- cuts[n.years + 1]

  get.iv <- function(dates)
    cbind(pmin(pmax(1, ceiling((dates[, 1] - beg.per) / iv.length)),
               n.years),
          pmin(pmax(1, ceiling((dates[, 2] - beg.per) / iv.length)),
               n.years))

  ## First, find the size of the new data frame (nn.out):
  nn.out <- 0

  ind.date <- cbind(dat$bdate + dat$enter, dat$bdate + dat$exit)
  cases <- ( (ind.date[, 1] < end.per) && (ind.date[, 2] > beg.per) )
  ind.iv <- get.iv(ind.date)
  ##return(ind.iv)
  nn.out <- sum(ind.iv[cases, 2] - ind.iv[cases, 1] + 1)
  ##return(nn.out)

  ## We now have 'nn.out'. We next create an empty data frame 'dat.out':
  xx <- cbind(dat[1, , drop = FALSE], com.dat[1, , drop = FALSE])
  dat.out <- matrix(NA, ncol = ncol(xx), nrow = nn.out)
  dat.out <- data.frame(dat.out)
  names(dat.out) <- names(xx)
  dat.out <- rbind(xx, dat.out)[-1, ]
  ##return(dat.out)

  ## And so we fill it!

  cat("Loop starting:\n")

  fixed.rec <- cbind(dat[1, , drop = FALSE], com.dat[1, , drop = FALSE])

  ## This part is the slow one (and simplified here) :

  for (cur.row in (1:no.of.outrows)){
    dat.out[cur.row, ] <- fixed.rec
      ## cbind(fixed.rec, com.dat[1, , drop = FALSE])
    ## cat("row = ", cur.row, "\n")
  }
  ## return(dat.out)
}
------------------------------------------------------------------------
> str(com.dat)
`data.frame':	215 obs. of  7 variables:
 $ V1: num  0.0000 0.0000 0.0807 0.0987 0.1801 ...
 $ V2: num  0.0277 0.0467 0.0654 0.0831 0.0992 ...
 $ V3: num  -0.0277 -0.0467  0.0153  0.0156  0.0809 ...
 $ V4: num  0.0000 0.0000 0.0000 0.0000 0.0162 ...
 $ V5: num  0.00083 0.00132 0.00180 0.00224 0.00262 ...
 $ V6: num  -0.00083 -0.00132 -0.00180 -0.00224  0.01360 ...
 $ V7: num   0.1905  0.0447 -0.4172 -0.1982  0.7761 ...

> str(dat)
`data.frame':	19848 obs. of  15 variables:
 $ enter    : num  57 58 59 60 63 ...
 $ exit     : num  58 59 60 63 64 ...
 $ stdod2   : num  0 0 0 0 0 0 0 0 1 0 ...
 $ stdod    : num  0 0 0 0 0 0 0 0 29 0 ...
 $ bdate    : num  1754 1754 1754 1754 1754 ...
 $ birthdate: num  1754 1754 1754 1754 1754 ...
 $ sex      : num  1 1 1 1 1 1 1 1 1 0 ...
 $ stparity : num  0 0 0 0 0 0 0 0 0 0 ...
 $ bthq     : num  4 4 4 4 4 4 4 4 4 3 ...
 $ bthpar   : num  1 1 1 1 1 1 1 1 1 1 ...
 $ socc     : Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 4 4 1 ...
 $ parish   : Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
 $ indiv    : num  1e+08 1e+08 1e+08 1e+08 1e+08 ...
 $ famil    : num  1e+05 1e+05 1e+05 1e+05 1e+05 ...
 $ familnu  : num  1e+05 1e+05 1e+05 1e+05 1e+05 ...

Now some timings: In the first two examples (identical) the output data 
frame is of order 55000 cases times 22 variables, but we only fill 100
of these cases:

> unix.time(koll(dat, com.dat, com.info[1, 1:2], 100))
[1] 48.70 23.86 74.00  0.00  0.00

Note that  R  seems to be 'learning':
> unix.time(koll(dat, com.dat, com.info[1, 1:2], 100))
[1] 33.00 23.28 57.69  0.00  0.0

In this example the output data frame is of size only around 300 x 22,
while exactly the same amount of information is written to it as above:
> unix.time(koll(dat[1:100, ], com.dat, com.info[1, 1:2], 100))
[1] 0.44 0.13 0.74 0.00 0.00

According to 'top' (I'm on Linux), no swapping is involved ( I have 
1.2 GB memory).
> gc()
           used (Mb) gc trigger  (Mb)
Ncells  1346357 36.0    2251281  60.2
Vcells 12622828 96.4   23650735 180.5

So size matters! Note that the full scale function will take a couple 
of hours even without any calulations at all.

Now the good part. If I rewrite 'koll' so that data are matrices instead 
of data frames:

> unix.time(hej <- koll(haag, com.dat, com.info[1, 1:2], 50000))
[1] 1.67 0.22 1.89 0.00 0.00                             ^^^^^
                                                          NOTE!
This is only ~3 times the compiled code. That's great!!
(Of course, some will be added with the real calculations.)

Sens moral: Avoid data frames for manipulations of this kind.
(Am I right?)

Göran

On Fri, 7 Dec 2001 james.holtman at convergys.com wrote:

> 
> Heres some timings from a 700MHZ laptop running WIN/2000:
> 
> > x.1 <- data.frame(a=integer(85000), b=double(85000), c=character(85000))
> > str(x.1)
> `data.frame':   85000 obs. of  3 variables:
>  $ a: int  0 0 0 0 0 0 0 0 0 0 ...
>  $ b: num  0 0 0 0 0 0 0 0 0 0 ...
>  $ c: Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
> #
> # loading up a variable with a vector takes very little time
> #
> > system.time(x.1$a <- 1:85000)
> [1] 0.03 0.00 0.03   NA   NA
> > str(x.1)
> `data.frame':   85000 obs. of  3 variables:
>  $ a: int  1 2 3 4 5 6 7 8 9 10 ...
>  $ b: num  0 0 0 0 0 0 0 0 0 0 ...
>  $ c: Factor w/ 1 level "": 1 1 1 1 1 1 1 1 1 1 ...
> #
> # a 'for' loop by itself is only 0.3 seconds
> #
> > system.time(for (i in 1:85000)invisible(1))
> [1] 0.30 0.00 0.31   NA   NA
> #
> # it takes me 5 seconds to initialize 85,000 of a variable, so I would
> assume
> # it would depend on how many and what type.  If 'factors', I would assume
> you would
> # declare those as 'character' and then convert to 'factor' at the end.
> # so it seems fast; is there something I am missing?
> #
> > system.time(for (i in 1:85000) x.1$a[i] <- i)
> [1] 5.12 0.04 5.22   NA   NA
> >
> 
> 
> 
> 
> "Liaw, Andy" <andy_liaw at merck.com>@stat.math.ethz.ch on 12/07/2001 10:32:31
> 
> Sent by:  owner-r-help at stat.math.ethz.ch
> 
> 
> To:   r-help at stat.math.ethz.ch
> cc:
> Subject:  RE: [R] rbind and data.frame
> 
> 
> Are you sure that the time difference is *only* in creating the data frame,
> rather than other computations in the loop?
> 
> Andy
> 
> > -----Original Message-----
> > From: Göran Broström [mailto:gb at stat.umu.se]
> > Sent: Friday, December 07, 2001 7:25 AM
> > To: Prof Brian Ripley
> > Cc: r-help at stat.math.ethz.ch
> > Subject: Re: [R] rbind and data.frame
> >
> >
> > On Fri, 7 Dec 2001, Prof Brian Ripley wrote:
> >
> > > On Fri, 7 Dec 2001, [iso-8859-1] Göran Broström wrote:
> > >
> > > > On Wed, 5 Dec 2001, Göran Broström wrote:
> > > >
> > > > [...]
> > > >
> > > > > My real problem is how to create a data frame in a
> > sequentially growing
> > > > > manner, when I know the final size (no of cases). I
> > want to avoid to
> > > > > call 'rbind' many times, and instead create an 'empty'
> > data frame in
> > > > > one call, and then fill it. Are there better ways of doing this?
> > > >
> > > > Got no answer to this one, so I provide one myself:
> > >
> > > The usual answer is to create a data frame of the desired size and
> > > populate it via indexing.  That's in some books I know!
> >
> > I know that book too (thanks!). I did what you suggest, and
> > that took 7
> > hours to run. Definitely.
> >
> > Göran
> >
> > > >
> > > > The answer is: Yes, definitely. I did this, with pure  R
> > code, and
> > > > created a new data frame with around 58000 records. It
> > took 7 hours to
> > > > run. I then did it with compiled code (Fortran), and that
> > made a slight
> > > > difference:  It took 4.8 seconds(!).
> > > >
> > > > Göran
> > > >
> > > >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > > > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > > > Send "info", "help", or "[un]subscribe"
> > > > (in the "body", not the subject !)  To:
> > r-help-request at stat.math.ethz.ch
> > > >
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> > > >
> > >
> > >
> >
> > --
> >  Göran Broström                      tel: +46 90 786 5223
> >  professor                           fax: +46 90 786 6614
> >  Department of Statistics            http://www.stat.umu.se/egna/gb/
> >  Umeå University
> >  SE-90187 Umeå, Sweden             e-mail: gb at stat.umu.se
> >
> > -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> > -.-.-.-.-.-.-.-.-
> > r-help mailing list -- Read
> > http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> > Send "info", "help", or "[un]subscribe"
> > (in the "body", not the subject !)  To:
> > r-help-request at stat.math.ethz.ch
> > _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> > _._._._._._._._._
> >
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.
> -.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._.
> _._._
> 
> 
> 
> --
> 
> NOTICE:  The information contained in this electronic mail transmission is
> intended by Convergys Corporation for the use of the named individual or
> entity to which it is directed and may contain information that is
> privileged or otherwise confidential.  If you have received this electronic
> mail transmission in error, please delete it from your system without
> copying or forwarding it, and notify the sender of the error by reply email
> or by telephone (collect), so that the sender's address records can be
> corrected.
> 
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
> 

-- 
 Göran Broström                      tel: +46 90 786 5223
 professor                           fax: +46 90 786 6614
 Department of Statistics            http://www.stat.umu.se/egna/gb/
 Umeå University
 SE-90187 Umeå, Sweden             e-mail: gb at stat.umu.se

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._