[Rd] allocation error and high CPU usage from kworker and migration: memory fragmentation?

Sat Mar 15 18:53:39 CET 2014

Hi,

I'm new to this list (and R), but my impression is that this question is 
more appropriate here than R-help. I hope that is right.

I'm having several issues with the performance of an R script. 
Occasionally it crashes with the well-known 'Error: cannot allocate 
vector of size X' (this past time it was 4.8 Gb). When it doesn't crash, 
CPU usage frequently drops quite low (often to 0) with high migration/X 
usage. Adding the 'last CPU used' field to top indicates that the R 
process is hopping from core to core quite frequently. Using taskset to 
set an affinity to one core results in CPU usage more typically in the 
40-60% range with no migration/X usage. But the core starts sharing time 
with a kworker task. renice'ing doesn't seem to change anything. If I 
had to guess, I would think that the kworker task is from R trying to 
re-arrange things in memory to make space for my large objects.

2 machines:
   - 128 and 256 GiB RAM,
   - dual processor Xeons (16 cores + hyperthreading, 32 total 'cores'),
   - Ubuntu 13.10 and 13.04 (both 64 bit),
   - R 3.0.2,
   - data.table 1.8.11 (svn r1129).*

Data: We have main fact tables stored in about 1000 R data files that 
range up to 3 GiB in size on disk; so up to like 50 GiB in RAM.

Questions:
   - Why is R skipping around cores so much? I've never seen that happen 
before with other R scripts or with other statistical software. Is it 
something I'm doing?
   - When I set the affinity of R to one core, why is there so much 
kworker activity? It seems obvious that it is the R script generating 
this kworker activity on the same core. I'm guessing this is R trying to 
recover from memory fragmentation?
   - I suspect a lot of my problem is from the merges. If I did that in 
one line, would this help at all?
     move <- merge(merge(move, upc, by=c('upc')), parent, by=c('store', 
'year'))
     * other strategies to improve merge performance?
   - If this is a memory fragmentation issue, is there a way to get 
lapply to allocate not just pointers to the data.tables that will be 
allocated, but to (over)allocate the data.tables themselves. The final 
list should be about 1000 data.tables long with each data.table no 
larger than 6000x4.

I've used data.table in a similar strategy to build lists like this 
before without issue from the same data. I'm not sure what is different 
about this code compared to my other code. Perhaps the merging?

The gist of the R code is pretty basic (modified for simplicity). The 
action is all happening in the reduction_function and lapply. I keep 
reassigning to move to try to indicate to R that it can gc the previous 
object referenced by move.

library(data.table)
library(lubridate)
# imports several data.tables, total 730 MiB
load(UPC) # provides PL_flag data.table
load(STORES) # and parent data.table
timevar = 'month'
by=c('retailer', 'month')
save.dir='/tmp/R_cache'
each.parent <- rbindlist(lapply(sort(list.files(MOVEMENT, full.names=T),
                                     reduction_function, upc=PL_flag,
                                     parent=parent, timevar=timevar, by=by))

reduction_function <- function(filename, upc, parent, timevar, by, 
save.dir=NA) {
     load(filename) # imports move a potentially large data.table 
(memory size 10 MiB-50 GiB)
     move[, c(timevar, 'year') := list(floor_date(week_end, unit=timevar),
                                       year(week_end))]
     move <- merge(move, upc, by=c('upc')) # adds is_PL column, a boolean
     move <- merge(move, parent, by=c('store', 'year') # adds parent 
column, an integer
     setkeyv(move, by)
     # this reduces move to a data.table with at most 6000 rows, but 
always 4 columns
     move <- move[, list(revenue=sum(price*units), 
revenue_PL=sum(price*units*is_PL)),
                keyby=by]
     move[, category := gsub(search, replace, filename)]
     return(move)
}

-- 
James Sams
sams.james at gmail.com