[R] Improving performance of split-apply problem

R. Michael Weylandt michael.weylandt at gmail.com
Thu Feb 23 14:12:16 CET 2012


It looks like what you are doing is reasonably efficient: I do think
there's a residuals element to the object returned by lm() so you
could just call that directly (which will be just a little more
efficient).

The bulk of the time is probably being taken up in the lm() call,
which has alot of overhead: you could use fastLm from the
RcppArmadillo package or lm.fit() directly to cut alot of this out.

Michael

On Wed, Feb 22, 2012 at 9:10 PM, Martin <misenial at gmail.com> wrote:
> Hello,
> I'm very new to R so my apologies if I'm making an obvious mistake.
>
> I have a data frame with ~170k rows and 14 numeric variables. The first 2
> of those variables (let's call them group1 and group2) are used to define
> groups: each unique pair of (group1,group2) is a group. There are roughly
> 50k such unique groups, with sizes varying from 1 through 40 rows each.
>
> My objective is to fit a linear regression within each group and get its
> mean square error (MSE). So the final output needs to be a collection of
> 50k MSE's.  Now, regardless of the size of the group, the regression needs
> to be run on exactly 40 observations. If the group has less than 40
> observations, then I need to add rows to get to 40, populating all
> variables with 0's for those extra rows. Here's the function I wrote to do
> this:
>
> get_MSE = function(x) {
>  rownames(x) = x$ID  #'ID' can take on any value from 1 to 40.
>  x = x[as.character(1:40), ]
>  x[is.na(x)] = 0
>  regressionResult = lm(A ~ B + C + D + E, data=x)  #A-E are some variables
> in the data frame.
>  MSE = mean((regressionResult$fitted.values - A)^2)
>  return(MSE)
> }
>
> library(plyr)
> output = ddply(dataset, list(dataset$group1, dataset$group2), get_MSE)
>
> The above code takes about 10 minutes to run, but I'd really need it to be
> much faster, if at all possible. Is there anything I can do to speed up the
> code?
>
> Thank you very much in advance.
>
> Jose
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list