[R] spending most of my time in assignments?

Ross Boylan ross at biostat.ucsf.edu
Thu Dec 26 21:45:07 CET 2013


On Thu, 2013-12-19 at 20:37 -0500, Duncan Murdoch wrote:
> On 13-12-19 6:37 PM, Ross Boylan wrote:
> > My code seems to be spending most of its time in assignment statements,
> > in some cases simple assignment of a model frame or model matrix.
> >
> > Can anyone provide any insights into what's going on, or how to speed
> > things up?
> 
> You are seeing a lot of time being spent on complex assignments.  For 
> example, line 158 is
> 
> data(sims.c1[[k]]) <- sp
> 
> That makes a function call to `data<-` to do the assignment, and that 
> could be slow.  Since it's an S4 method there's a bunch of machinery 
> involved in dispatching it; most of that would not have line number 
> information, so it'll be charged to that line.
> 
> I can't really suggest how to speed it up.
> 
> Duncan Murdoch
Simply reexpressing the same computations without assigning to S4 slots
or dispatching using S4 cut the execution time to under 50% of what it
had been.

I attempted to recreate this with a small bit of code, but got nothing
like the effect in the original. 

Here's the example; v1 is like my original code and v2 is like the
restructured code:
slow <- setClass("slow",
                 representation(form="formula",  data="data.frame", mm="matrix"))
slowb <- setClass("slowb", representation(form="formula", data="data.frame", mm="matrix"))
slowc <- setClass("slowc", representation(form="formula", data="data.frame", mm="matrix"))
# original had only 3 classes
# fake data.  Using my real data didn't seem to affect relative times much.
# real data had ~ 100 columns in model matrix                                                                                                                                                                                                                                                             
mydata <- data.frame(a=rnorm(1500), b=rnorm(1500), c=rnorm(1500),d=as.factor(rep(c("x", "y", "z"), 500)))
myformula <- ~a*d + b*d + c*d
mymm <- model.matrix(myformula, mydata)

if (!isGeneric("putData<-"))
  setGeneric("putData<-", function(obj, value) standardGeneric("putData<-"))

setMethod("putData<-", c("slow", "data.frame"),
          function(obj, value){
            myf <- model.frame(obj at form, value)
            obj at data <- myf
            obj at mm <- model.matrix(obj at form, myf)
            obj
          }
)
setMethod("putData<-", c("slowb", "data.frame"),
          function(obj, value){
            myf <- model.frame(obj at form, value)
            obj at data <- myf
            obj at mm <- model.matrix(obj at form, myf)
            obj
          }    
)
setMethod("putData<-", c("slowc", "data.frame"),
          function(obj, value){
            myf <- model.frame(obj at form, value)
            obj at data <- myf
            obj at mm <- model.matrix(obj at form, myf)
            obj
          }    
)

v1 <- function(n) {
  s <- list(slow(form=myformula, mm=mymm), 1:5)
  for (i in 1:n) {
    mydata$b <- rnorm(nrow(mydata))
    putData(s[[1]]) <- mydata
  }
  mm <- s[[1]]@mm
}

# v2 eliminates the dispatch on putData and the assignment
# to the S4 slot
v2 <- function(n) {
  s <- slow(form=myformula, mm=mymm)
  for (i in 1:n) {
    mydata$b <- rnorm(nrow(mydata))
    myf <- model.frame(s at form, mydata)
    mm <- model.matrix(s at form, myf)
  }
  mm
}

> system.time(r <- v1(100))                                                                                                                                                                                                                                                   
    user  system elapsed
   0.304   0.000   0.307                                                                                                                                                                                                                                                       
> system.time(r <- v2(100))                                                                                                                                                                                                                                                   
    user  system elapsed
    0.26    0.00    0.26

Ross Boylan

> 
> >
> > For starters, is it possible that the reports are not accurate, or that
> > I am misreading them.  In R 3.0.1 (running under ESS):
> >   > Rprof(line.profiling=TRUE)
> >   > system.time(r <- totalEffect(dodata[[1]], dodata[[2]], 1:3, 4))
> >      user  system elapsed
> >    21.629   0.756  22.469
> > !> Rprof(NULL)
> >   > summaryRprof(lines="both")
> >   $by.self
> >                              self.time self.pct total.time total.pct
> >   box.R#158                       6.74    29.56      13.06     57.28
> >   simulator.multinomial.R#64      2.92    12.81       2.96     12.98
> >   simulator.multinomial.R#63      2.76    12.11       2.76     12.11
> >   box.R#171                       2.54    11.14       5.08     22.28
> >   simulator.d1.R#70               0.98     4.30       0.98      4.30
> >   simulator.d1.R#71               0.98     4.30       0.98      4.30
> >   densMap.R#42                    0.72     3.16       0.86      3.77
> >   "standardGeneric"               0.52     2.28      11.30     49.56
> > ......
> >
> > Here's some of the code, with comments at the line numbers
> > box.R:
> >                  sp <- merge(sexpartner, data, by="studyidx")
> >                  sp$y <- numFactor(sp$pEthnic)  #I think y is not used but must be present
> >                  data(sims.c1[[k]]) <- sp    ###<<<<< line 158
> >                  sp0 <- sp
> >                  sp <- sim(sims.c1[[k]], i)
> >                  ctable[[k]] <- update.c1(ctable[[k]], sp)
> >                  if (is.null(i.c1.in)) {
> >                      i.c1.in <- match("pEthnic", colnames(sp0))
> >                      i.c1.out <- match(c("studyidx", "n", "pEthnic"), colnames(sp))
> >                  }
> >                  sp0 <- merge(sp0[,-i.c1.in], sp[,i.c1.out], by=c("studyidx", "n"))
> >                  # d1
> >                  sp0 <- sp0[sp0$pIsMale == 1,]
> >                  # avoid lots of conversion warnings
> >                  sp0$pEthnic <- factor(sp0$pEthnic, levels=partRaceLevels)
> >                  data(sims.d1[[k]]) <- sp0    ###<<<<< line 171
> >                  sp <- sim(sims.d1[[k]], i)
> >                  dtable[[k]] <- update.d1(dtable[[k]], sp)
> >                  rngstate[[k]] <- .Random.seed
> > The timing seems odd since it doesn't appear there's anything to do at
> > the 2 lines except invoke data<-, but if that's slow I would expect the
> > time to go to the data<- function (in a different file) and not to the
> > call.
> >
> > In fact the other big time items are inside the data<- functions.
> > simulator.multinomial.R:
> >
> >     setMethod("data<-", c("simulator.multinomial", "data.frame"),
> >            function(obj, value) {
> >      mf <- model.frame(obj at dataFormula, data=value)
> >      mf$iCluster <- fromOrig(obj at idmap, as.character(mf$studyidx))
> >      if (any(is.na(mf$iCluster)))
> >          stop("New studyidx--need to draw from meta distn")
> >      mm <- model.matrix(obj at modelFormula, data=mf)
> >      obj at data <- mf  ##<<< line 63
> >      obj at mm <- mm    ##<<< line 64
> >      return(obj)
> > })
> >
> > The mm and data slots have type restrictions, but no other validation
> > tests.
> > setClass("simulator.multinomial",
> >           representation(fit="stanfit", idmap="sIDMap",
> >                          modelFormula="formula",
> >                          categories="ANY",  # could be factor or character
> >                                          # categories should be in the order of their numeric codes in y
> >                          # cached results
> >                          coef="list",
> >                          data="data.frame",
> >                          dataFormula="formula",
> >                          mm="matrix"))
> > Does it matter that, e.g., a model frame is more than a vanilla data frame?
> >
> > I thought assignment, given R's lazy copying behavior, was essentially
> > resetting a pointer, and so should be fast.
> >
> > Or maybe the time is going to garbage collecting the previous contents
> > of the slots?
> >
> > Ross Boylan
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



More information about the R-help mailing list