[R] Large data and space use

Sat Nov 27 18:56:50 CET 2021

Several recent questions and answers have mad e me look at some code and I
realized that some functions may not be great to use when you are dealing
with very large amounts of data that may already be getting close to limits
of your memory. Does the function you call to do one thing to your object
perhaps overdo it and make multiple copies and not delete them as soon as
they are not needed?

An example was a recent post suggesting a nice set of tools you can use to
convert your data.frame so the columns are integers or dates no matter how
they were read in from a CSV file or created.

What I noticed is that often copies of a sort were made by trying to change
the original say to one date format or another and then deciding which, if
any to keep. Sometimes multiple transformations are tried and this may be
done repeatedly with intermediates left lying around. Yes, the memory will
all be implicitly returned when the function completes. But often these
functions invoke yet other functions which work on their copies. You an end
up with your original data temporarily using multiple times as much actual
memory.

R does have features so some things are "shared" unless one copy or another
changes. But in the cases I am looking at, changes are the whole idea.

What I wonder is whether such functions should clearly call an rm() or the
equivalent as soon as possible when something is no longer needed.

The various kinds of pipelines are another case in point as they involve all
kinds of hidden temporary variables that eventually need to be cleaned up.
When are they removed? I have seen pipelines with 10 or more steps as
perhaps data is read in, has rows removed or columns removed or re-ordered
and grouping applied and merged with others and reports generated. The
intermediates are often of similar sizes with the data and if large, can add
up. If writing the code linearly using temp1 and temp2 type of variables to
hold the output of one stage and the input of the text stage, I would be
tempted to add a rm(temp1) as soon as it was finished being used, or just
reuse the same name of temp1 so the previous contents are no longer being
pointed to and can be taken by the garbage collector at some time.

So I wonder if some functions should have a note in their manual pages
specifying what may happen to the volume of data as they run. An example
would be if I had a function that took a matrix and simply squared it using
matrix multiplication. There are various ways to do this and one of them
simply makes a copy and invokes the built-in way in R that multiplies two
matrices. It then returns the result. So you end up storing basically three
times the size  of the matrix right before you return it. Other methods
might do the actual multiplication in loops operating on subsections of the
matrix and if done carefully, never keep more than say 2.1 times as much
data around. 

Or is this not important often enough? All I know, is data may be getting
larger much faster than memory in our machines gets larger.

	[[alternative HTML version deleted]]