[R] Reading File Sizes: very slow!

Richard O'Keefe r@oknz @end|ng |rom gm@||@com
Sun Sep 26 04:49:09 CEST 2021


On a $150 second-hand laptop with 0.9GB of library,
and a single-user installation of R so only one place to look
LIBRARY=$HOME/R/x86_64-pc-linux-gnu-library/4.0
cd $LIBRARY
echo "kbytes package"
du -sk * | sort -k1n

took 150 msec to report the disc space needed for every package.

That'

On Sun, 26 Sept 2021 at 06:14, Bill Dunlap <williamwdunlap using gmail.com> wrote:
>
> On my Windows 10 laptop I see evidence of the operating system caching
> information about recently accessed files.  This makes it hard to say how
> the speed might be improved.  Is there a way to clear this cache?
>
> > system.time(L1 <- size.f.pkg(R.home("library")))
>    user  system elapsed
>    0.48    2.81   30.42
> > system.time(L2 <- size.f.pkg(R.home("library")))
>    user  system elapsed
>    0.35    1.10    1.43
> > identical(L1,L2)
> [1] TRUE
> > length(L1)
> [1] 30
> > length(dir(R.home("library"),recursive=TRUE))
> [1] 12949
>
> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help <
> r-help using r-project.org> wrote:
>
> > Dear List Members,
> >
> >
> > I tried to compute the file sizes of each installed package and the
> > process is terribly slow.
> >
> > It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
> >
> >
> > 1.) Package Sizes
> >
> >
> > system.time({
> >          x = size.pkg(file=NULL);
> > })
> > # elapsed time: 509 s !!!
> > # 512 Packages; 1.64 GB;
> > # R 4.1.1 on MS Windows 10
> >
> >
> > The code for the size.pkg() function is below and the latest version is
> > on Github:
> >
> > https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
> >
> >
> > Questions:
> > Is there a way to get the file size faster?
> > It takes long on Windows as well, but of the order of 10-20 s, not 10
> > minutes.
> > Do I miss something?
> >
> >
> > 1.b.) Alternative
> >
> > It came to my mind to read first all file sizes and then use tapply or
> > aggregate - but I do not see why it should be faster.
> >
> > Would it be meaningful to benchmark each individual package?
> >
> > Although I am not very inclined to wait 10 minutes for each new try out.
> >
> >
> > 2.) Big Packages
> >
> > Just as a note: there are a few very large packages (in my list of 512
> > packages):
> >
> > 1  123,566,287               BH
> > 2  113,578,391               sf
> > 3  112,252,652            rgdal
> > 4   81,144,868           magick
> > 5   77,791,374 openNLPmodels.en
> >
> > I suspect that sf & rgdal have a lot of duplicated data structures
> > and/or duplicate code and/or duplicated libraries - although I am not an
> > expert in the field and did not check the sources.
> >
> >
> > Sincerely,
> >
> >
> > Leonard
> >
> > =======
> >
> >
> > # Package Size:
> > size.f.pkg = function(path=NULL) {
> >      if(is.null(path)) path = R.home("library");
> >      xd = list.dirs(path = path, full.names = FALSE, recursive = FALSE);
> >      size.f = function(p) {
> >          p = paste0(path, "/", p);
> >          sum(file.info(list.files(path=p, pattern=".",
> >              full.names = TRUE, all.files = TRUE, recursive = TRUE))$size);
> >      }
> >      sapply(xd, size.f);
> > }
> >
> > size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
> >      x = size.f.pkg(path=path);
> >      x = as.data.frame(x);
> >      names(x) = "Size"
> >      x$Name = rownames(x);
> >      # Order
> >      if(sort) {
> >          id = order(x$Size, decreasing=TRUE)
> >          x = x[id,];
> >      }
> >      if( ! is.null(file)) {
> >          if( ! is.character(file)) {
> >              print("Error: Size NOT written to file!");
> >          } else write.csv(x, file=file, row.names=FALSE);
> >      }
> >      return(x);
> > }
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list