[Rd] allocation error and high CPU usage from kworker and migration: memory fragmentation?
sams.james at gmail.com
Sat Mar 15 18:53:39 CET 2014
I'm new to this list (and R), but my impression is that this question is
more appropriate here than R-help. I hope that is right.
I'm having several issues with the performance of an R script.
Occasionally it crashes with the well-known 'Error: cannot allocate
vector of size X' (this past time it was 4.8 Gb). When it doesn't crash,
CPU usage frequently drops quite low (often to 0) with high migration/X
usage. Adding the 'last CPU used' field to top indicates that the R
process is hopping from core to core quite frequently. Using taskset to
set an affinity to one core results in CPU usage more typically in the
40-60% range with no migration/X usage. But the core starts sharing time
with a kworker task. renice'ing doesn't seem to change anything. If I
had to guess, I would think that the kworker task is from R trying to
re-arrange things in memory to make space for my large objects.
- 128 and 256 GiB RAM,
- dual processor Xeons (16 cores + hyperthreading, 32 total 'cores'),
- Ubuntu 13.10 and 13.04 (both 64 bit),
- R 3.0.2,
- data.table 1.8.11 (svn r1129).*
Data: We have main fact tables stored in about 1000 R data files that
range up to 3 GiB in size on disk; so up to like 50 GiB in RAM.
- Why is R skipping around cores so much? I've never seen that happen
before with other R scripts or with other statistical software. Is it
something I'm doing?
- When I set the affinity of R to one core, why is there so much
kworker activity? It seems obvious that it is the R script generating
this kworker activity on the same core. I'm guessing this is R trying to
recover from memory fragmentation?
- I suspect a lot of my problem is from the merges. If I did that in
one line, would this help at all?
move <- merge(merge(move, upc, by=c('upc')), parent, by=c('store',
* other strategies to improve merge performance?
- If this is a memory fragmentation issue, is there a way to get
lapply to allocate not just pointers to the data.tables that will be
allocated, but to (over)allocate the data.tables themselves. The final
list should be about 1000 data.tables long with each data.table no
larger than 6000x4.
I've used data.table in a similar strategy to build lists like this
before without issue from the same data. I'm not sure what is different
about this code compared to my other code. Perhaps the merging?
The gist of the R code is pretty basic (modified for simplicity). The
action is all happening in the reduction_function and lapply. I keep
reassigning to move to try to indicate to R that it can gc the previous
object referenced by move.
# imports several data.tables, total 730 MiB
load(UPC) # provides PL_flag data.table
load(STORES) # and parent data.table
timevar = 'month'
each.parent <- rbindlist(lapply(sort(list.files(MOVEMENT, full.names=T),
parent=parent, timevar=timevar, by=by))
reduction_function <- function(filename, upc, parent, timevar, by,
load(filename) # imports move a potentially large data.table
(memory size 10 MiB-50 GiB)
move[, c(timevar, 'year') := list(floor_date(week_end, unit=timevar),
move <- merge(move, upc, by=c('upc')) # adds is_PL column, a boolean
move <- merge(move, parent, by=c('store', 'year') # adds parent
column, an integer
# this reduces move to a data.table with at most 6000 rows, but
always 4 columns
move <- move[, list(revenue=sum(price*units),
move[, category := gsub(search, replace, filename)]
sams.james at gmail.com
More information about the R-devel