[Rd] execution time of .packages

Romain Francois romain.francois at dbmail.com
Tue Mar 3 12:59:26 CET 2009


Prof Brian Ripley wrote:
> On Tue, 3 Mar 2009, Romain Francois wrote:
>
>> Prof Brian Ripley wrote:
>>> The caching is in the disc system: you need to find and read the 
>>> package metadata for every package.  AFAIK it is not easy to flush 
>>> the disc cache, but quite easy to overwrite it with later reads.  
>>> (Google for more info.)
>> Thanks for the info, I'll try to find my way with these directions.
>>
>>> If you are not concerned about validity of the installed packages 
>>> you could skip the tests and hence the reads.
>>>
>>> Your times are quite a bit slower than mine, so a faster disc system 
>>> might help.  Since my server has just been rebooted (for a new 
>>> kernel), with all of CRAN and most of BioC I get
>>>
>>>> system.time( packs <- .packages( all = T ) )
>>>    user  system elapsed
>>>   0.518   0.262  25.042
>>>> system.time( packs <- .packages( all = T ) )
>>>    user  system elapsed
>>>   0.442   0.080   0.522
>>>> length(packs)
>>> [1] 2096
>>>
>>> There's a similar issue when installing packages: the Perl code 
>>> reads the indices from every visible package to resolve links, and 
>>> that can be slow the first time.
>>>
>>>
>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>
>>>> Hello,
>>>>
>>>> The first time in a session I call .packages( all.available = T ), 
>>>> it takes a long time (I have many packages installed from CRAN):
>>>>
>>>>> system.time( packs <- .packages( all = T ) )
>>>> user  system elapsed
>>>> 0.738   0.276  43.787
>>>>
>>>> When I call it again, the time is now much reduced, so there must 
>>>> be some caching somewhere. I would like to try to reduce the  time 
>>>> it takes the first time, but I have not been able to identify where 
>>>> the caching takes place, and so how I can remove it to try to 
>>>> improve the running time without the caching. Without this, I have 
>>>> to restart my computer each time to vanish the caching to test a 
>>>> new version of the function (this is not going to happen)
>>>>
>>>> Here is the .packages function, I am suspicious about this part : 
>>>> "ans <- c(ans, nam)" which grows the ans vector each time a 
>>>> suitable package is found, this does not sound right.
>>>
>>> It's OK as there are only going to be ca 2000 packages.  Try 
>>> profiling this: .readRDS and grepl take most of the time.
>
>> I usually do not trust the result of the profiler when a for loop is 
>> involved, as it tends to miss the point (or maybe I am).
>
> Here are the data for the actual example (repeated for this message):
>
>> Rprof()
>> system.time( packs <- .packages( all = T ) )
>    user  system elapsed
>   0.447   0.078   0.525
>> Rprof(NULL)
>> summaryRprof()
> $by.self
>                    self.time self.pct total.time total.pct
> "grepl"                 0.18     34.6       0.18      34.6
> ".readRDS"              0.12     23.1       0.20      38.5
> ".packages"             0.08     15.4       0.50      96.2
> "close.connection"      0.04      7.7       0.04       7.7
> "close"                 0.02      3.8       0.06      11.5
> "file.exists"           0.02      3.8       0.02       3.8
> "gc"                    0.02      3.8       0.02       3.8
> "gzfile"                0.02      3.8       0.02       3.8
> "list"                  0.02      3.8       0.02       3.8
> "system.time"           0.00      0.0       0.52     100.0
> "file.path"             0.00      0.0       0.02       3.8
>
> $by.total
>                    total.time total.pct self.time self.pct
> "system.time"            0.52     100.0      0.00      0.0
> ".packages"              0.50      96.2      0.08     15.4
> ".readRDS"               0.20      38.5      0.12     23.1
> "grepl"                  0.18      34.6      0.18     34.6
> "close"                  0.06      11.5      0.02      3.8
> "close.connection"       0.04       7.7      0.04      7.7
> "file.exists"            0.02       3.8      0.02      3.8
> "gc"                     0.02       3.8      0.02      3.8
> "gzfile"                 0.02       3.8      0.02      3.8
> "list"                   0.02       3.8      0.02      3.8
> "file.path"              0.02       3.8      0.00      0.0
>
> $sampling.time
> [1] 0.52
>
> there is little tiime unaccounted for, and 0.38 sec is going in 
> .readRDS and grepl.  Whereas
>
> system.time({
> ans <- character(0)
> for(i in 1:2096) ans <- c(ans, "foo")
> })
>
> takes 0.024 secs, negligible here (one profiler tick).
Here is what happens to me if I restart the computer:

 > Rprof( )
 > system.time( packs <- .packages( all = T ) )
   user  system elapsed
  0.888   0.342  35.589
 > Rprof(NULL)
 > summaryRprof()
$by.self
                   self.time self.pct total.time total.pct
".readRDS"              0.34     28.8       0.64      54.2
".packages"             0.14     11.9       1.16      98.3
"file.exists"           0.14     11.9       0.14      11.9
"gzfile"                0.12     10.2       0.16      13.6
"close"                 0.10      8.5       0.14      11.9
"grepl"                 0.08      6.8       0.10       8.5
"$"                     0.08      6.8       0.08       6.8
"file.path"             0.06      5.1       0.06       5.1
"close.connection"      0.04      3.4       0.04       3.4
"getOption"             0.02      1.7       0.04       3.4
"as.character"          0.02      1.7       0.02       1.7
"gc"                    0.02      1.7       0.02       1.7
"options"               0.02      1.7       0.02       1.7
"system.time"           0.00      0.0       1.18     100.0

$by.total
                   total.time total.pct self.time self.pct
"system.time"            1.18     100.0      0.00      0.0
".packages"              1.16      98.3      0.14     11.9
".readRDS"               0.64      54.2      0.34     28.8
"gzfile"                 0.16      13.6      0.12     10.2
"file.exists"            0.14      11.9      0.14     11.9
"close"                  0.14      11.9      0.10      8.5
"grepl"                  0.10       8.5      0.08      6.8
"$"                      0.08       6.8      0.08      6.8
"file.path"              0.06       5.1      0.06      5.1
"close.connection"       0.04       3.4      0.04      3.4
"getOption"              0.04       3.4      0.02      1.7
"as.character"           0.02       1.7      0.02      1.7
"gc"                     0.02       1.7      0.02      1.7
"options"                0.02       1.7      0.02      1.7

$sampling.time
[1] 1.18

I'd like to know what is happening in the 35.589  - 1.18 seconds, and 
the profiler won't tell me.

About the time spent by grepl, we could take this down by calling it 
once instead of many times

 > system.time( grepl( valid_package_version_regexp, versions ) )
   user  system elapsed
  0.003   0.000   0.009
 > system.time( for(v in versions) grepl( valid_package_version_regexp, 
v ) )  
  user  system elapsed
  0.100   0.000   0.136

>> Consider this script below,
>
> Whether profiling works in other examples is beside the point here.
I'll make another thread on that.

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr



More information about the R-devel mailing list