[Rd] Model object, when generated in a function, saves entire environment when saved

Kenny Bell kmbe||56 @end|ng |rom gm@||@com
Wed Jan 29 20:25:53 CET 2020


Reviving an old thread. I haven't noticed this be a problem for a while
when saving RDS's which is great. However, I noticed the problem again when
saving `qs` files (https://github.com/traversc/qs) which is an RDS
replacement with a fast serialization / compression system.

I'd like to get an idea of what change was made within R to address this
issue for `saveRDS`. My thought is that this will help the author of the
`qs` package do something similar. I have had a browse through the release
notes for the last few years (Ctrl-F-ing "environment") and couldn't see it.

Many thanks for any help and best wishes to all.

The following code uses R 3.6.2 and requires you to run
install.packages("qs") first:

save_size_qs <- function (object) {
  tf <- tempfile(fileext = ".qs")
  on.exit(unlink(tf))
  qs::qsave(object, file = tf)
  file.size(tf)
}

save_size_rds <- function (object) {
  tf <- tempfile(fileext = ".rds")
  on.exit(unlink(tf))
  saveRDS(object, file = tf)
  file.size(tf)
}

normal_lm <- function(){
  junk <- 1:1e+08
  lm(Sepal.Length ~ Sepal.Width, data = iris)
}

normal_ggplot <- function(){
  junk <- 1:1e+08
  ggplot2::ggplot()
}

clean_lm <- function () {
  junk <- 1:1e+08
  # Run the lm in its own environment
  env <- new.env(parent = globalenv())
  env$subset <- subset
  with(env, lm(Sepal.Length ~ Sepal.Width, data = iris))
}

# The qs save size includes the junk but the rds does not
save_size_qs(normal_lm())
#> [1] 848396
save_size_rds(normal_lm())
#> [1] 4163
save_size_qs(normal_ggplot())
#> [1] 857446
save_size_rds(normal_ggplot())
#> [1] 12895


# Both exclude the junk when separating the lm into its own environment
save_size_qs(clean_lm())
#> [1] 6154
save_size_rds(clean_lm())
#> [1] 4255


On Thu, Jul 28, 2016 at 7:31 AM Kenny Bell <kmbell56 using gmail.com> wrote:

> Thanks so much for all this.
>
> The first solution is what I'm going with as I want the terms object to
> come along so that predict still works.
>
> On Wed, Jul 27, 2016 at 12:28 PM, William Dunlap via R-devel <
> r-devel using r-project.org> wrote:
>
>> Another solution is to only save the parts of the model object that
>> interest you.  As long as they don't include the formula (which is
>> what drags along the environment it was created in), you will
>> save space.  E.g.,
>>
>> tfun2 <- function(subset) {
>>    junk <- 1:1e6
>>    list(subset=subset, lm(Sepal.Length ~ Sepal.Width, data=iris,
>> subset=subset)$coef)
>> }
>>
>> saveSize(tfun2(1:4))
>> #[1] 152
>>
>>
>>
>> Bill Dunlap
>> TIBCO Software
>> wdunlap tibco.com
>>
>> On Wed, Jul 27, 2016 at 11:19 AM, William Dunlap <wdunlap using tibco.com>
>> wrote:
>>
>> > One way around this problem is to make a new environment whose
>> > parent environment is .GlobalEnv and which contains only what the
>> > the call to lm() requires and to compute lm() in that environment.
>>  E.g.,
>> >
>> > tfun1 <- function (subset)
>> > {
>> >     junk <- 1:1e+06
>> >     env <- new.env(parent = globalenv())
>> >     env$subset <- subset
>> >     with(env, lm(Sepal.Length ~ Sepal.Width, data = iris, subset =
>> subset))
>> > }
>> > Then we get
>> >    > saveSize(tfun1(1:4)) # see below for def. of saveSize
>> >    [1] 910
>> > instead of the 2129743 bytes in the save file when using the naive
>> method.
>> >
>> > saveSize <- function (object) {
>> >     tf <- tempfile(fileext = ".RData")
>> >     on.exit(unlink(tf))
>> >     save(object, file = tf)
>> >     file.size(tf)
>> > }
>> >
>> >
>> >
>> > Bill Dunlap
>> > TIBCO Software
>> > wdunlap tibco.com
>> >
>> > On Wed, Jul 27, 2016 at 10:48 AM, Kenny Bell <kmb56 using berkeley.edu>
>> wrote:
>> >
>> >> In the below, I generate a model from an environment that isn't
>> >> .GlobalEnv with a large object that is unrelated to the model
>> >> generation. It seems to save the irrelevant object unnecessarily. In
>> >> my actual use case, I am running and saving many models in a loop that
>> >> each use a single large data.frame (that gets collapsed into a small
>> >> data.frame for estimation), so removing it isn't an option.
>> >>
>> >> In the case where the model exists in .GlobalEnv, everything is
>> >> peachy. So replicating whatever happens when saving the model that was
>> >> generated in .GlobalEnv at the return() stage of the function call
>> >> would fix this problem.
>> >>
>> >> I was referred to this list from r-bugs. First time r-devel poster.
>> >>
>> >> Hope this helps,
>> >>
>> >> Kendon
>> >>
>> >> ```
>> >> tmp_fun <- function(x){
>> >>   iris_big <- lapply(1:10000, function(x) iris)
>> >>   lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >> }
>> >>
>> >> out <- tmp_fun(1)
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 57196752 - way too big
>> >>
>> >> # Works fine when in .GlobalEnv
>> >> iris_big <- lapply(1:10000, function(x) iris)
>> >> out <- lm(Sepal.Length ~ Sepal.Width, data = iris)
>> >>
>> >> object.size(out)
>> >> # 48008
>> >> save(out, file = "tmp.RData", compress = FALSE)
>> >> file.size("tmp.RData")
>> >> # 16641 - good size.
>> >> ```
>> >>
>> >>         [[alternative HTML version deleted]]
>> >>
>> >> ______________________________________________
>> >> R-devel using r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >>
>> >
>> >
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list