[R] do.call vs. lapply for lists

Marc Schwartz marc_schwartz at comcast.net
Mon Apr 9 19:05:52 CEST 2007


On Mon, 2007-04-09 at 12:45 -0400, Muenchen, Robert A (Bob) wrote:
> Hi All,
> 
> I'm trying to understand the difference between do.call and lapply for
> applying a function to a list. Below is one of the variations of
> programs (by Marc Schwartz) discussed here recently to select the first
> and last n observations per group.
> 
> I've looked in several books, the R FAQ and searched the archives, but I
> can't find enough to figure out why lapply doesn't do what do.call does
> in this case. The help files & newsletter descriptions of do.call sound
> like it would do the same thing, but I'm sure that's due to my lack of
> understanding about their specific terminology. I would appreciate it if
> you could take a moment to enlighten me. 
> 
> Thanks,
> Bob
> 
> mydata <- data.frame(
>   id      = c('001','001','001','002','003','003'),
>   math    = c(80,75,70,65,65,70),
>   reading = c(65,70,88,NA,90,NA)
> )
> mydata
> 
> mylast <- lapply( split(mydata,mydata$id), tail, n=1)
> mylast
> class(mylast) #It's a list, so lapply will so *something* with it.
> 
> #This gets the desired result:
> do.call("rbind", mylast)
> 
> #This doesn't do the same thing, which confuses me:
> lapply(mylast,rbind)
> 
> #...and data.frame won't fix it as I've seen it do in other
> circumstances:
> data.frame( lapply(mylast,rbind) )

Bob,

A key difference is that do.call() operates (in the above example) as if
the actual call was:

> rbind(mylast[[1]], mylast[[2]], mylast[[3]])
   id math reading
3 001   70      88
4 002   65      NA
6 003   70      NA

In other words, do.call() takes the quoted function and passes the list
object as if it was a list of individual arguments. So rbind() is only
called once.

In this case, rbind() internally handles all of the factor level issues,
etc. to enable a single common data frame to be created from the three
independent data frames contained in 'mylast':

> str(mylast)
List of 3
 $ 001:'data.frame':    1 obs. of  3 variables:
  ..$ id     : Factor w/ 3 levels "001","002","003": 1
  ..$ math   : num 70
  ..$ reading: num 88
 $ 002:'data.frame':    1 obs. of  3 variables:
  ..$ id     : Factor w/ 3 levels "001","002","003": 2
  ..$ math   : num 65
  ..$ reading: num NA
 $ 003:'data.frame':    1 obs. of  3 variables:
  ..$ id     : Factor w/ 3 levels "001","002","003": 3
  ..$ math   : num 70
  ..$ reading: num NA


On the other hand, lapply() (as above) calls rbind() _separately_ for
each component of mylast.  It therefore acts as if the following series
of three separate calls were made:


> rbind(mylast[[1]])
   id math reading
3 001   70      88

> rbind(mylast[[2]])
   id math reading
4 002   65      NA

> rbind(mylast[[3]])
   id math reading
6 003   70      NA


Of course, the result of lapply() is that the above are combined into a
single R list object and returned:

> lapply(mylast, rbind)
$`001`
   id math reading
3 001   70      88

$`002`
   id math reading
4 002   65      NA

$`003`
   id math reading
6 003   70      NA


It is a subtle, but of course critical, difference in how the internal
function is called and how the arguments are passed.

Does that help?

Regards,

Marc Schwartz



More information about the R-help mailing list