[R] Splitting or Subsetting Using foreach

Doran, Harold HDoran at air.org
Thu Dec 1 18:27:43 CET 2016


I am having tremendous fortune using the foreach function in the foreach package sending work out to multiple cores in order to reduce computational time.

I am experimenting with which types of tasks benefit from running in parallel and which do not and so this is a bit of a learning experience by trial and error.

One particular task I cannot seem to realize a benefit from (in terms of reduced time) is splitting or subsetting a large data frame. I realize there are other "fast" options like using data.table, but current goal is to see if this can benefit from multiple cores or not. 

So, a very small toy example of how I am approaching the "traditional" and "parallel" way is as follows. My actual data is much, much larger and it turns out the parallel version of doing it this way vis-à-vis the traditional way is unbelievably slow. Hence Im not sure if there is a good theoretical reason why such a task cannot run faster when sent out to multiple cores if there is a user error that I need to better understand and correct

library(foreach)
library(doParallel)
registerDoParallel(cores=4)

tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))

ff1 <- split(tmp, tmp$id)

myList <- unique(tmp$id)
N <- length(myList)
ff2 <- foreach(i = 1:N) %dopar% { tmp[which(tmp$id == myList[i]),]}

Thanks,
Harold



More information about the R-help mailing list