[R] Hand-crafting an .RData file

Tue Nov 10 00:54:24 CET 2009

If you can manage to write out your data in separate binary files, one for each column, then another possibility is using package ff. You can link those binary columns into R by defining an ffdf dataframe: columns are memory mapped and you can access those parts you need - without initially importing them. This is much faster than a csv import and also works for files that are too large to import at once. If all your columns have the same storage.mode (vmode in ff), then another alternative is writing out all your data in one single binary matrix with major row-order (because that can be written row by row from your program) and link the file into R as a single ff_matrix.

Since ffdf in ff is new, I give a mini-tutorial below.
Let me know how that works for you.

Kind regards

Jens Oehlschlägel

library(ff)

# Create example csv
fnam <- "/tmp/example.csv"
write.csv(data.frame(a=1:9, b=1:9+0.1), file=fnam, row.names=FALSE)

# Create example binary files on disk.
# Reading csv into ffdf actually stores
# each column as a binary file on disk.
# Using a pattern outside fftempdir automatically sets finalizer="close"
# and thus makes those binary files permanent.
path <- "/tmp/example_"
x <- read.csv.ffdf(file=fnam, ff_args=list(pattern=path))
close(x)

# Note that a standard ffdf is made-up column by column from simple ff objects.
# More coplex mappings from ff objects into ffdf are possible, 
# but let's keep it simple for now.
p <- physical(x)
p

# Now let's just create an ffdf from existing binary files.
# Step one: create an ff object for each binary file (without reading them).

# Note that because we open ff files outside fftempdir, 
# the default finalizer is "close", not "delete", 
# so the file will not be deleted on finalization
# files are opened for memory mapping, but not read
ffcols <- vector("list", length(p))
for (i in 1:length(p)){
  ffcols[[i]] <- ff(filename=filename(p[[i]]), vmode=vmode(p[[i]]))
}
ffcols

# step two: bundle several ff objects into one ffdf data.frame 
# (still without reading data)
ffdafr <- ffdf(a=ffcols[[1]], b=ffcols[[2]])

# now reading rows from this will return a standard data.frame 
# (and only read the required rows)
ffdafr[1:4,]
ffdafr[5:9,]

# As an alternative create an example binary 
# (double) matrix in major row order
y <- as.ff(t(ffdafr[,]), filename="d:/tmp/example_single_matrix.ff")

# Again we can link this existing binary file.
# if we know the size of the matrix we can do
z <- ff(filename=filename(y), vmode="double", dim=c(9,2), dimorder=c(2,1))
z
rm(z)

# If we only know the number of columns we can do
z <- ff(filename=filename(y), vmode="double")
# and set dim later
dim(z) <- c(length(z)/2, 2)
# Note that so far we have interpreted the file in major column order
z
# To interpret the file in major column order we set dimorder 
# (a generalization for n-way arrays)
dimorder(z) <- c(2,1)
z

# removing the ff objects will trigger finalizer 
# at next garbage collection
rm(x, ffcols, ffdafr, y, z)
gc()

# since we carefully selected the "close" finalizer, 
# the files still exist
dir(path="/tmp", pattern="example_")

# now remove them physically
unlink(file.path("/tmp", dir(path="/tmp", pattern="example_")))

-- 
GRATIS für alle GMX-Mitglieder: Die maxdome Movie-FLAT!