[R] speeding up regressions using ddply

Ista Zahn izahn at psych.rochester.edu
Wed Sep 22 16:41:06 CEST 2010


Hi Alison,

On Wed, Sep 22, 2010 at 11:05 AM, Alison Macalady <ali at kmhome.org> wrote:
>
>
> Hi,
>
> I have a data set that I'd like to run logistic regressions on, using ddply
> to speed up the computation of many models with different combinations of
> variables.

In my experience ddply is not particularly fast. I use it a lot
because it is flexible and has easy to understand syntax, not for it's
speed.

I would like to run regressions on every unique two-variable
> combination in a portion of my data set,  but I can't quite figure out how
> to do using ddply.

I'm not sure ddply is the tool for this job.

The data set looks like this, with "status" as the
> binary dependent variable and V1:V8 as potential independent variables in
> the logistic regression:
>
> m <- matrix(rnorm(288), nrow = 36)
> colnames(m) <- paste('V', 1:8, sep = '')
> x <- data.frame( status = factor(rep(rep(c('D','L'), each = 6), 3)),
>               as.data.frame(m))
>

You can use combn to determine the combinations you want:

Varcombos <- combn(names(x)[-1], 2)

>From there you can do a loop, something like

results <- list()
for(i in 1:dim(Varcombos)[2])
{
  log.glm <- glm(as.formula(paste("status ~ ", Varcombos[1,i],  " + ",
Varcombos[2,i], sep="")), family=binomial(link=logit),
na.action=na.omit, data=x)
  glm.summary<-summary(log.glm)
  aic <- extractAIC(log.glm)
  coef <- coef(glm.summary)
  results[[i]] <- list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2])
#or whatever other output here
  names(results)[i] <- paste(Varcombos[1,i], Varcombos[2,i], sep="_")
}

I'm sure you could replace the loop with something more elegant, but
I'm not really sure how to go about it.

> I used melt to put my data frame into a more workable format
> require(reshape)
> xm <- melt(x, id = 'status')
>
> Here is the basic shape of the function I'd like to apply to every
> combination of variables in the dataset:
>
> h<- function(df)
> {
>
> attach(df)
> log.glm <- (glm(status ~ value1+ value2 , family=binomial(link=logit),
> na.action=na.omit)) #What I can't figure out is how to specify 2 different
> variables (I've put value1 and value2 as placeholders) from the xm to
> include in the model
>
> glm.summary<-summary(log.glm)
> aic <- extractAIC(log.glm)
> coef <- coef(glm.summary)
> list(Est1=coef[1,2], Est2=coef[3,2],  AIC=aic[2]) #or whatever other output
> here
> }
>
> And then I'd like to use ddply to speed up the computations.
>
> require(pplyr)
> output<-dddply(xm, .(variable), as.data.frame.function(h))
> output
>
>
> I can easily do this using ddply when I only want to use 1 variable in the
> model, but can't figure out how to do it with two variables.

I don't think this approach can work. You are saying "split up xm by
variable" and then expecting  to be able to reference different levels
of variable within each split, an impossible request.

Hope this helps,
Ista

>
> Many thanks for any hints!
>
> Ali
>
>
>
> --------------------
> Alison Macalady
> Ph.D. Candidate
> University of Arizona
> School of Geography and Development
> & Laboratory of Tree Ring Research
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org



More information about the R-help mailing list