[Rd] options(keep.source = TRUE) -- also for "library(.)" ?

Martin Maechler Martin Maechler <maechler@stat.math.ethz.ch>
Fri, 28 Apr 2000 18:19:50 +0200 (CEST)


I'm replying to myself once more :

[and this gets more and more envolved, please "d" if you're not interested ..]

>>>>> "MM" == Martin Maechler <maechler@stat.math.ethz.ch> writes:

    MM> (and I haven't seen more feedback..)

>>>>> "PD" == Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk> writes:

      ........

    MM> Of course we now could even make 
    MM>	   keep.source = getOption("keep.source") 
    MM> an argument to library(), being propagated to sys.source(..).

 MM>   I'm considering to commit the necessary changes and add the following to
 MM>    NEWS [for "R-devel"]
 MM> 
 MM>    o	library(), require(), and sys.source() have a new argument
 MM> 	` keep.source = getOption("keep.source") '.
 MM> 
 MM>   Hence, by default, functions from all packages (not just base)
 MM>   `keep their source'.
 MM> 
 MM>   Is this okay for everyone ?

Now, I still haven't committed the new code, but I have been using it
myself and made a "big picture statistic" *using* the new code, and gc()
for many packages (actually I've done this for all CRAN packages and more)
to find  how much memory is "spilled" by   keep.source = TRUE.

Here are the results :

I show the difference in memory usage {Vcells & Ncells, see ?gc & ?Memory}
for interesting packages, only using R builtin and CRAN (non-Devel) packages:

   Package    Bytes used
	      additionally with   Ncells  Vcells
	      keep.source= TRUE                 
				                
   nlme       2305'364             19023  107659   (actually  nlme + nls)
   survival5  1066'776		   8867    49792
   MASS        631'628		   5186    29507
   mclust      493'512		   4349    22936
   boot        456'944		   3833    21314
   ctest       309'288		   2406    14502
   ts          297'368		   2311    13944
   cluster     244'120		   2270    11298
   nls         236'668		   1871    11085
   wavethresh  218'624		   1878    10180
   mda         215'944		   1878    10046
   rpart       203'892		   1654     9533
   chron       194'640		   1735     9038
   tseries     183'360		   1505     8566
   locfit      176'416		   1632     8168
   tree        166'844		   1248     7843
   modreg      116'752		    989     5442
   nnet         98'124		    838     4571
   splines      85'112		    769     3948
   mva          79'280		    710     3680
   lqs          34'116		    292     1589
   eda          10'860		    105      501
   zmatrix       7'196		     82      327
   Devore5           0		      0        0  [took this to "test"

I.e., for the nlme() one needs an extra 2.3 MBytes of memory just for
"keep.source = TRUE".

I further investigated a bit how much the "keep.source" of base ``costs''
memory wise.
Note that I still don't know how to turn it off easily for base (Peter ?).

However, I just counted how much "source" is in base :
  > length(ob <- ls(pos= match("package:base",search()), all.nam = TRUE))
  [1] 1193
  > length(fns <- ob[sapply(ob, function(n)is.function(get(n)))])
  [1] 1169
  > stem(len.src <- sapply(fns, function(n)sum(nchar(attr(get(n),"source")))))

   The decimal point is 3 digit(s) to the right of the |
  
    0 | 00000000000000000000000000000000000000000000000000000000000000000000+980
    1 | 00000000000112222233333333334444555555666667777778888899
    2 | 00000012333334444555556666777788899
    3 | 12234477
    4 | 15669
    5 | 34
    6 | 
    7 | 35
    8 | 
    9 | 
   10 | 2
		(guess *which* is the  outlier  ;-)

  > sum(len.src)
  [1] 359964

i.e., only ~360'000 characters.

Now compare this with  survival5 which was scoring pretty high above :

  > library(survival5, keep.source = TRUE)
  > length(ob <- ls(pos= match("package:survival5",search()), all.nam = TRUE))
  [1] 117
  > length(fns <- ob[sapply(ob, function(n)is.function(get(n)))])
  [1] 116
  > stem(len.src <- sapply(fns, function(n)sum(nchar(attr(get(n),"source")))))

    The decimal point is 3 digit(s) to the right of the |

     0 | 00000000011111111111111112222222233344444455555789
     1 | 0001122334445567777899
     2 | 0001134555555789
     3 | 02233445567
     4 | 12368
     5 | 2478
     6 | 14799
     7 | 
     8 | 0
     9 | 
    10 | 
    11 | 
    12 | 
    13 | 
    14 | 
    15 | 4
    16 | 
    17 | 3

  > sum(len.src)
  [1] 235633

i.e.  about 2/3 of "base".

(but then base has "source" attributes for much more objects)
Very crude extrapolation would mean that turning off the "keep.source" for
"base" would save about 1.5 MBytes of RAM {I'd guess even more..}

After all this testing, I think what we really want is
"keep.source = FALSE" (including for "base" !)
WHEN working with large data, working on smallish machines,
    or for all "batch" processing.

Hence I'd propose

1.  
  options(keep.source = interactive())

  in the default profile

2. {as proposed earlier today -- see below}

  provide a command line option to turn it on or off.
------------

    PD> The real question is whether we want to have a different mechanism
    PD> for controlling whether keep.source is set or not. 

    MM> right.

    PD> Originally it was FALSE for the base library to save space, and
    PD> according the same setting was used for other libraries since some
    PD> of them are rather large, but later it got flipped to TRUE for
    PD> base,
    MM> (yes, I'm still wondering...)
    PD> and then there is little point in setting it FALSE for packages. 
    PD> Question is whether anyone would want the old behaviour
    PD> back to get more space for analyses?

  would be nice if it *was* configurable for base as well;
  possibly both via cmd line option
  (something like --keepsource / --no-keepsource )
  and a setting in Rprofile..

  MM> From grepping through the source code, I don't see how it was turned off
  MM> for base...

 anyone [R-core] ?

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._