[R] Parallel Scan of Large File

Wed Dec 8 02:22:57 CET 2010

Is it possible to parallel scan a large file into a character vector in 1M
chunks using scan() with the "doMC" package? Furthermore, can I specify the
tasks for each child?

i.e. I'm working on a Linux box with 8 cores and would like to scan in 8M
records at time (all 8 cores scan 1M records at a time) from a file with 40M
records total.

file <- file("data.txt","r")
child <- foreach(i = icount(40)) %dopar%
{
    scan(file,what = "character",sep = "\n",skip = 0,nlines = 1e6)
}

Thus, each child would have a different skip argument. child[[1]]: skip = 0,
child[[2]]: skip = 1e6 + 1, child[[3]]: skip = 2e6 + 1, ... ,child[[40]]:
skip = 39e6 + 1. I would then end up with a list of 40 vectors with
child[[1]] containing records 1 to 1000000, child[[2]] containing records
1000001 to 2000000, ... ,child[[40]] containing records 39000001 to
40000000. 

Also, would one file connection suffice or does their need to be a file
connection that opens and closes for each child?

-- 
View this message in context: http://r.789695.n4.nabble.com/Parallel-Scan-of-Large-File-tp3077545p3077545.html
Sent from the R help mailing list archive at Nabble.com.