[Rd] In memory caching algorithm

Henrik Bengtsson hb at stat.berkeley.edu
Sun Aug 13 02:51:37 CEST 2006


Hi,

quite a few times I have ran into problems where I would like to read
data from a data source too huge to keep in memory.  Not only this,
but retrieving the data from that source is a bit slow.  I wish to
work with only small chunks of the data in memory at any time.

Now, sometimes the same data is retrived multiple times calling for an
in-RAM caching mechanism.  To make it simple, consider only the case
where we read data, that is, no writing/updating.

First of all, the file cache (and the swap) of the OS is not good
enough; the data is spread out on different files and within each file
the order of the data is such that we can do better than the file
cache.

The data structures that I'm targetting are simple vectors (including
lists), but also matrices (where each column is one data element; I
think it's more optimal to work with columns because of the way the
matrices are represented in memory).

I've tried to "scribble" down some thoughts about how such a cache
algorithm can be designed.  Before I get too serious about this, I
would like to check if someone else have done a similar thing or know
of references to papers/algorithms that implement this.

Here are my notes/outline.  I would appreciate any kind of feedback.

Setup:
A large dataset lives on a persistent and slow media.  The data is too
big to keep in memory forcing us to access the data from the
persistent media.  The data elements are indexed by i=1,2,...,I.  The
same data elements might be accessed multiple times.  Using a cache we
wish to minimize the number of times a random elements is read of the
persistent media.

Requirement:
Thus, the cache, which can hold C elements, have to keep track of what
elements it holds.  We will use a caching strategy where we assume
that the latest read element is most likely element to be read again.
Thus, we wish to have a cache that keeps most recent elements and
pushes out old elements.

Design:
The indices of the data elements currently in the cache are stored in
vector J of length C, i.e. J = (J_1, ..., J_C), where J_c \in [1,I], c
= 1,...,C.  The elements in J are unique, that is J_k != J_l for all k
!= l.

Assume D data elements are request to be read.  Let these D elements
be specified by vector K = (K_1, ..., K_D) where K_d \in [1,I].  Note
that some of these elements may be duplicated.

API:
Say we want to read indices K.  Without cache we would do something like:

  y <- readData(db, K)

with cache

  y <- readDataViaCache(db, K)

where 'db' is the "database".  The reference to the actual cache is
known by the 'db' object.

Algorithm:
* Create the return structure.  Assume a vector for simplicity.

  x <- vector(length=D)

* Identify cache hits, that is what data elements are already in the
cache.  Pull out that data from the cache and the rest from the main
data source. This can be done as follows:

  c <- match(K, J)  # Gets the cache position or NA
  inCache <- !is.na(c)
  x[inCache] <- cacheData(J[c[inCache]])
  x[!inCache] <- slowData(K[!inCache])

Three cases to worry/think about:

Case 1) all(inCache) is TRUE
Case 2) any(inCache) is FALSE
Case 3) otherwise

I leave this for now, but the below can probably optimized depending
which case we have.

* Update the cache.  We have to be careful not to store duplicated
elements in the cache.  We also want to keep track of how "recent" a
certain data element is.  One way to do this is to order the cache by
age, but since we want to minimize the amount of reshuffling in the
cache, it is better to keep track of the age in a separate vector t =
(t_1, ..., t_C).  The age of the requested data elements are denoted
by u = (u_1, ..., u_D) such that u_D = -D, u_{D-1} = -(D-1), ..., u_1
= -1.  Thus, the last element requested is the most recent one
(because its age value is smaller than any other age values).  The age
of the elements in the cache are all t_c > 0.  This makes it easy to
differentiate u_d and t_c.

Update the age of the elements in the cache that was read from the
cache.  Do it according to their age in the requested vector, that is:

  t[c[inCache]] <- u[inCache]

There are E = sum(!inCache) elements that were read from the main data
source and that are not in the cache.  These elments indexed by
L=K[!inCache] have ages v = u[!inCache].  Note that we might have
duplicated elements here.  Me might want to remove duplicates already
here by keeping the most recent duplicated element;

 nondup <- rev(!duplicated(rev(v)))
 v <- v[nondup]
 L <- L[nondup]
 Q <- which(!inCache)[nondup]  # The k indices of these

We have to worry about the mapping back later (which I actually don't
think is a problem).

Next, identify the set of these that are more recent that the ones in
the cache t.  To do this:

  w <- sort(c(v, t))  # partial sorting up to C will be enough!
  wMax <- w[C]        # This is max age that "fits" the cache

  toCache <- (v <= wMax)
  Q <- Q[toCache]

Note that this identifies M=sum(toCache) elements that should be put
in the cache.  Three cases exists:

Case M == 0: The cache don't have to be updated. Just update t and we're done;

  # TO DO: How should t be updated?!?
  ...
  t <- order(t) - (C+1)

Case M == C: The complete cache have to be replaced.  The values to
put in the cache are:

  cacheData <- x[Q]
  J <- K[Q]
  t <- v[toCache]
  t <- order(t) - (C+1)

Case M < C: Only parts of the cache have to be replaced.

  notToUpdate <- c[inCache]
  toUpdate <- setdiff(1:C, notToUpdate)
  cacheData[toUpdate] <- x[Q]
  J[toUpdate] <- K[Q]
  t[toUpdate] <- v[toCache]
  t <- order(t) - (C+1)

Cheers

/Henrik



More information about the R-devel mailing list