[R] Reading File Sizes: very slow!

Rui Barradas ru|pb@rr@d@@ @end|ng |rom @@po@pt
Mon Sep 27 09:32:45 CEST 2021


Hello,

R 4.1.0 on Ubuntu 20.04, sessionInfo at the end.

I'm arriving a bit late to this thread but here are the timings I'm 
getting on an 10+ years old PC.

1. I am not getting anything even close to 5 or 10 mins running times.
2. Like Bill said, there seems to be a caching effect, the first runs 
are consistently slower. And this is Ubuntu, not Windows, so different 
OS's present the same behavior. It's not unexpected, disk accesses are 
slow operations and have been cached for a while now.
3. I am not at all sure if this is relevant but as for how to clean the 
Windows File Explorer cache, open a File Explorer window and click

View > Options > (Privacy section) Clear

4. Now for my timings. The cache effect is large, from 23s down to 2.5s.
But even with an old PC nowhere near 300s or 500s.

rui using rui:~$ R -q -f rhelp.R
#
# functions size.pkg and size.f.pkg omitted
#
 > R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
 >
 > cat("\nLeonard Mada's code:\n\n")

Leonard Mada's code:

 > system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
    user  system elapsed
   1.700   0.988  23.339
 > system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
    user  system elapsed
   1.578   0.921   2.540
 > system.time({
+         x = size.pkg(path=R_LIBS_USER, file=NULL)
+ })
    user  system elapsed
   1.542   0.949   2.523
 >
 > cat("\nBill Dunlap's code:\n\n")

Bill Dunlap's code:

 > system.time(L1 <- size.f.pkg(R_LIBS_USER))
    user  system elapsed
   1.608   0.887   2.538
 > system.time(L2 <- size.f.pkg(R_LIBS_USER))
    user  system elapsed
   1.515   0.982   2.510
 > identical(L1,L2)
[1] TRUE
 > length(L1)
[1] 1773
 > length(dir(R_LIBS_USER,recursive=TRUE))
[1] 85204
 >
 > cat("\n\nsessionInfo return value:\n\n")


sessionInfo return value:

 > sessionInfo()
R version 4.1.1 (2021-08-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: Ubuntu 20.04.3 LTS

Matrix products: default
BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0

locale:
  [1] LC_CTYPE=pt_PT.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=pt_PT.UTF-8        LC_COLLATE=pt_PT.UTF-8
  [5] LC_MONETARY=pt_PT.UTF-8    LC_MESSAGES=pt_PT.UTF-8
  [7] LC_PAPER=pt_PT.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=pt_PT.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

loaded via a namespace (and not attached):
[1] compiler_4.1.1


And the sapply code.


rui using rui:~$ R -q -f rhelp2.R
 > R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
 > path <- R_LIBS_USER
 > system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+      function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
    user  system elapsed
   0.802   0.901  15.964
 >
 >
rui using rui:~$ R -q -f rhelp2.R
 > R_LIBS_USER <- Sys.getenv("R_LIBS_USER")
 > path <- R_LIBS_USER
 > system.time({
+ sapply(list.dirs(path=path, full.name=F, recursive=F),
+      function(f) length(list.files(path = file.path(path, f),
+ full.names = FALSE, recursive = TRUE)))
+ })
    user  system elapsed
   0.730   0.528   1.264


Once again the 2nd run took a fraction of the 1st run.

Leonard, if you are getting those timings, is there another process 
running or that has previously run and eat up the cache?

Hope this helps,

Rui Barradas

Às 23:31 de 26/09/21, Leonard Mada via R-help escreveu:
> 
> On 9/27/2021 1:06 AM, Leonard Mada wrote:
>>
>> Dear Bill,
>>
>>
>> Does list.files() always sort the results?
>>
>> It seems so. The option: full.names = FALSE does not have any effect:
>> the results seem always sorted.
>>
>>
>> Maybe it is better to process the files in an unsorted order: as
>> stored on the disk?
>>
> 
> After some more investigations:
> 
> This took only a few seconds:
> 
> sapply(list.dirs(path=path, full.name=F, recursive=F),
>       function(f) length(list.files(path = paste0(path, "/", f),
> full.names = FALSE, recursive = TRUE)))
> 
> # maybe with caching, but the difference is enormous
> 
> 
> Seems BH contains *by far* the most files: 11701 files.
> 
> But excluding it from processing did have only a liniar effect: still 377 s.
> 
> 
> I had a look at src/main/platform.c, but do not fully understand it.
> 
> 
> Sincerely,
> 
> 
> Leonard
> 
> 
>>
>> Sincerely,
>>
>>
>> Leonard
>>
>>
>> On 9/25/2021 8:13 PM, Bill Dunlap wrote:
>>> On my Windows 10 laptop I see evidence of the operating system
>>> caching information about recently accessed files.  This makes it
>>> hard to say how the speed might be improved.  Is there a way to clear
>>> this cache?
>>>
>>>> system.time(L1 <- size.f.pkg(R.home("library")))
>>>     user  system elapsed
>>>     0.48    2.81   30.42
>>>> system.time(L2 <- size.f.pkg(R.home("library")))
>>>     user  system elapsed
>>>     0.35    1.10    1.43
>>>> identical(L1,L2)
>>> [1] TRUE
>>>> length(L1)
>>> [1] 30
>>>> length(dir(R.home("library"),recursive=TRUE))
>>> [1] 12949
>>>
>>> On Sat, Sep 25, 2021 at 8:12 AM Leonard Mada via R-help
>>> <r-help using r-project.org <mailto:r-help using r-project.org>> wrote:
>>>
>>>      Dear List Members,
>>>
>>>
>>>      I tried to compute the file sizes of each installed package and the
>>>      process is terribly slow.
>>>
>>>      It took ~ 10 minutes for 512 packages / 1.6 GB total size of files.
>>>
>>>
>>>      1.) Package Sizes
>>>
>>>
>>>      system.time({
>>>               x = size.pkg(file=NULL);
>>>      })
>>>      # elapsed time: 509 s !!!
>>>      # 512 Packages; 1.64 GB;
>>>      # R 4.1.1 on MS Windows 10
>>>
>>>
>>>      The code for the size.pkg() function is below and the latest
>>>      version is
>>>      on Github:
>>>
>>>      https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R
>>>      <https://github.com/discoleo/R/blob/master/Stat/Tools.CRAN.R>
>>>
>>>
>>>      Questions:
>>>      Is there a way to get the file size faster?
>>>      It takes long on Windows as well, but of the order of 10-20 s,
>>>      not 10
>>>      minutes.
>>>      Do I miss something?
>>>
>>>
>>>      1.b.) Alternative
>>>
>>>      It came to my mind to read first all file sizes and then use
>>>      tapply or
>>>      aggregate - but I do not see why it should be faster.
>>>
>>>      Would it be meaningful to benchmark each individual package?
>>>
>>>      Although I am not very inclined to wait 10 minutes for each new
>>>      try out.
>>>
>>>
>>>      2.) Big Packages
>>>
>>>      Just as a note: there are a few very large packages (in my list
>>>      of 512
>>>      packages):
>>>
>>>      1  123,566,287               BH
>>>      2  113,578,391               sf
>>>      3  112,252,652            rgdal
>>>      4   81,144,868           magick
>>>      5   77,791,374 openNLPmodels.en
>>>
>>>      I suspect that sf & rgdal have a lot of duplicated data structures
>>>      and/or duplicate code and/or duplicated libraries - although I am
>>>      not an
>>>      expert in the field and did not check the sources.
>>>
>>>
>>>      Sincerely,
>>>
>>>
>>>      Leonard
>>>
>>>      =======
>>>
>>>
>>>      # Package Size:
>>>      size.f.pkg = function(path=NULL) {
>>>           if(is.null(path)) path = R.home("library");
>>>           xd = list.dirs(path = path, full.names = FALSE, recursive =
>>>      FALSE);
>>>           size.f = function(p) {
>>>               p = paste0(path, "/", p);
>>>               sum(file.info <http://file.info>(list.files(path=p,
>>>      pattern=".",
>>>                   full.names = TRUE, all.files = TRUE, recursive =
>>>      TRUE))$size);
>>>           }
>>>           sapply(xd, size.f);
>>>      }
>>>
>>>      size.pkg = function(path=NULL, sort=TRUE, file="Packages.Size.csv") {
>>>           x = size.f.pkg(path=path);
>>>           x = as.data.frame(x);
>>>           names(x) = "Size"
>>>           x$Name = rownames(x);
>>>           # Order
>>>           if(sort) {
>>>               id = order(x$Size, decreasing=TRUE)
>>>               x = x[id,];
>>>           }
>>>           if( ! is.null(file)) {
>>>               if( ! is.character(file)) {
>>>                   print("Error: Size NOT written to file!");
>>>               } else write.csv(x, file=file, row.names=FALSE);
>>>           }
>>>           return(x);
>>>      }
>>>
>>>      ______________________________________________
>>>      R-help using r-project.org <mailto:R-help using r-project.org> mailing list
>>>      -- To UNSUBSCRIBE and more, see
>>>      https://stat.ethz.ch/mailman/listinfo/r-help
>>>      <https://stat.ethz.ch/mailman/listinfo/r-help>
>>>      PLEASE do read the posting guide
>>>      http://www.R-project.org/posting-guide.html
>>>      <http://www.R-project.org/posting-guide.html>
>>>      and provide commented, minimal, self-contained, reproducible code.
>>>
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list