[R] How to reduce memory demands in a function?

Wed Sep 9 16:12:10 CEST 2009

If it is taking 10 hours to run, you might want to use Rprof on a
small subset to see where time is being spent.  Also can you use
matrices instead of dataframe because there is a cost in accessing
them if you are just using them in a matrix-like way.  Take a look at
the output from Rprof and that may help identify places to optimize
the code.  Also how large are the distance matrices that you are
creating?  Depending on you input, these can be large.

On Wed, Sep 9, 2009 at 9:24 AM, Richard Gunton<r.gunton at talk21.com> wrote:
>
> Dear Jim,
>
> Thanks for telling me about gc() - that may help..
>
> Here's some more detail on my function.  It's to implement a methodology for
> classifying plant communities into functional groups according to the values
> of species traits, and comparing the usefulness of alternative
> classifications by using them to summarise species abundance data and then
> correlating site-differences based on these abundance data with
> site-differences based on environmental variables.  The method is described
> in Pillar et al (2009), J. Vegetation Science 20: 334-348.
>
> First the function produces a set of classifications of species by applying
> the function agnes() from the library "cluster" to all possible combinations
> of the variables of a species-by-traits dataframe Q.  It also prepares a
> distance matrix dR based on environmental variables for the same sites.
> Then there is a loop that takes each ith classification in turn and
> summarises a raw-data dataframe of species abundances into a shorter
> dataframe Xi, by grouping clusters of its rows according to the
> classification. It then calculates a distance matrix (dXi) based on this
> summary of abundances, and another distance matrix (dQi) based on
> corresponding variables of the matrix Q directly.  Finally in the loop,
> mantel.partial() from the library "vegan" is used to run a partial mantel
> test between dXi and dR, conditioned on dQi.  The argument "permutations" is
> set to zero, and only the mantel statistic is stored.
>
> The loop also contains a forward stepwise selection procedure so that not
> all classifications are actually used. After all classifications using
> (e.g.) a single variable have been used, the variable(s) involved in the
> best classification(s) are specified for inclusion in the next cycle of the
> loop. I wonder how lucid that all was...
>
> I began putting together the main parts of the code, but I fear it takes so
> much explanation (not to mention editing for transparency) that it may not
> be worth the effort unless someone is really committed to following it
> through - it's about 130 lines in total.  I could still do this if it's
> likely to be worthwhile...
>
> However, the stepwise procedure was only fully implemented after I sent my
> first email.  Now that none of the iterative output is stored except the
> final mantel statistic and essential records of which classifications were
> used, the memory demand has decreased.  The problem that now presents itself
> is simply that the function still takes a very long time to run (e.g. 10
> hours to go through 7 variables stepwise, with distance matrices of
> dimension 180 or so).
>
> Two parts of the code that feel clumsy to me already are:
> unstack(stack(by(X,f,colSums)))    # to reduce a dataframe X to a dataframe
> with fewer rows by summing within sets of rows defined by the factor f
> and
> V <- list();  for(i in 1:n) V[[i]] <- grep(pattern[i], x);  names(V) <-
> 1:q;  V <- stack(V);  V[,1] # to get the indices of multiple matches from x
> (which is a vector of variable names some of which may be repeated) for each
> of several character strings in pattern
>
> The code also involves relating columns of dataframes to each other using
> character-matching - e.g. naming columns with paste() -ed strings of all the
> variables used to create them, and then subsequently strsplit() -ing these
> names so that columns can be selected that contain any of the specified
> variable names.
>
> I'm grateful for any advice!
>
> Thanks,   Richard.
>
> ________________________________
> From: jim holtman <jholtman at gmail.com>
> To: Richard Gunton <r.gunton at talk21.com>
> Sent: Tuesday, 8 September, 2009 2:08:51 PM
> Subject: Re: [R] How to reduce memory demands in a function?
>
> Can you at least post what the function is doing and better yet,
> provide commented, minimal, self-contained, reproducible code.  You
> can put call to memory.size() to see how large things are growing,
> delete temporary objects when not needed, make calls to gc(), ... ,
> but it is hard to tell what without an example.
>
>
> On Mon, Sep 7, 2009 at 4:16 AM, Richard Gunton<r.gunton at talk21.com> wrote:
>>
> I've written a function that regularly throws the "cannot allocate vector of
> size X Kb" error, since it contains a loop that creates large numbers of big
> distance matrices. I'd be very grateful for any simple advice on how to
> reduce the memory demands of my function.  Besides increasing memory.size to
> the maximum available, I've tried reducing my "dist" objects to 3 sig. fig.s
> (not sure if that made any difference), I've tried the distance function
> daisy() from package "cluster" instead of dist(), and I've avoided storing
> unnecessary intermediary objects as far as possible by nesting functions in
> the same command.  I've even tried writing each of my dist() objects to a
> text file, one line for each, and reading them in again one at a time as and
> when required, using scan() - and although this seemed to avoid the memory
> problem, it ran so slowly that it wasn't much use for someone with deadlines
> to meet...
>
> I don't have formal training in programming, so if there's something handy I
> should read, do let me know.
>
> Thanks,
>
> Richard Gunton.
>
> Postdoctoral researcher in arable weed ecology, INRA Dijon.
>
>
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?