[Rd] execution time of .packages

Romain Francois romain.francois at dbmail.com
Tue Mar 3 13:39:10 CET 2009


Prof Brian Ripley wrote:
> Let me repeat: what is happening for me in the equivalent of your 
> 35.589 - 1.18 seconds is that R is waiting for my OS to read its discs 
> (and they can be heard chuntering away).  As the R process is not 
> runniing at those times, the profiler is not running either (on a 
> Unix-alike: on Windows the profiler does measure elapsed time).  I 
> expect it will be the same explanation for you.
Thank you. I get it this time.

> What I have already suggested is that if you want to save time, do not 
> read and check the package.rds files.   As far as I can see they were 
> checked at installation in any recent version of R.  Just check their 
> existence.
>
> On Tue, 3 Mar 2009, Romain Francois wrote:
>
>> Prof Brian Ripley wrote:
>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>
>>>> Prof Brian Ripley wrote:
>>>>> The caching is in the disc system: you need to find and read the 
>>>>> package metadata for every package.  AFAIK it is not easy to flush 
>>>>> the disc cache, but quite easy to overwrite it with later reads.  
>>>>> (Google for more info.)
>>>> Thanks for the info, I'll try to find my way with these directions.
>>>>
>>>>> If you are not concerned about validity of the installed packages 
>>>>> you could skip the tests and hence the reads.
>>>>>
>>>>> Your times are quite a bit slower than mine, so a faster disc 
>>>>> system might help.  Since my server has just been rebooted (for a 
>>>>> new kernel), with all of CRAN and most of BioC I get
>>>>>
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>>    user  system elapsed
>>>>>   0.518   0.262  25.042
>>>>>> system.time( packs <- .packages( all = T ) )
>>>>>    user  system elapsed
>>>>>   0.442   0.080   0.522
>>>>>> length(packs)
>>>>> [1] 2096
>>>>>
>>>>> There's a similar issue when installing packages: the Perl code 
>>>>> reads the indices from every visible package to resolve links, and 
>>>>> that can be slow the first time.
>>>>>
>>>>>
>>>>> On Tue, 3 Mar 2009, Romain Francois wrote:
>>>>>
>>>>>> Hello,
>>>>>>
>>>>>> The first time in a session I call .packages( all.available = T 
>>>>>> ), it takes a long time (I have many packages installed from CRAN):
>>>>>>
>>>>>>> system.time( packs <- .packages( all = T ) )
>>>>>> user  system elapsed
>>>>>> 0.738   0.276  43.787
>>>>>>
>>>>>> When I call it again, the time is now much reduced, so there must 
>>>>>> be some caching somewhere. I would like to try to reduce the  
>>>>>> time it takes the first time, but I have not been able to 
>>>>>> identify where the caching takes place, and so how I can remove 
>>>>>> it to try to improve the running time without the caching. 
>>>>>> Without this, I have to restart my computer each time to vanish 
>>>>>> the caching to test a new version of the function (this is not 
>>>>>> going to happen)
>>>>>>
>>>>>> Here is the .packages function, I am suspicious about this part : 
>>>>>> "ans <- c(ans, nam)" which grows the ans vector each time a 
>>>>>> suitable package is found, this does not sound right.
>>>>>
>>>>> It's OK as there are only going to be ca 2000 packages.  Try 
>>>>> profiling this: .readRDS and grepl take most of the time.
>>>
>>>> I usually do not trust the result of the profiler when a for loop 
>>>> is involved, as it tends to miss the point (or maybe I am).
>>>
>>> Here are the data for the actual example (repeated for this message):
>>>
>>>> Rprof()
>>>> system.time( packs <- .packages( all = T ) )
>>>    user  system elapsed
>>>   0.447   0.078   0.525
>>>> Rprof(NULL)
>>>> summaryRprof()
>>> $by.self
>>>                    self.time self.pct total.time total.pct
>>> "grepl"                 0.18     34.6       0.18      34.6
>>> ".readRDS"              0.12     23.1       0.20      38.5
>>> ".packages"             0.08     15.4       0.50      96.2
>>> "close.connection"      0.04      7.7       0.04       7.7
>>> "close"                 0.02      3.8       0.06      11.5
>>> "file.exists"           0.02      3.8       0.02       3.8
>>> "gc"                    0.02      3.8       0.02       3.8
>>> "gzfile"                0.02      3.8       0.02       3.8
>>> "list"                  0.02      3.8       0.02       3.8
>>> "system.time"           0.00      0.0       0.52     100.0
>>> "file.path"             0.00      0.0       0.02       3.8
>>>
>>> $by.total
>>>                    total.time total.pct self.time self.pct
>>> "system.time"            0.52     100.0      0.00      0.0
>>> ".packages"              0.50      96.2      0.08     15.4
>>> ".readRDS"               0.20      38.5      0.12     23.1
>>> "grepl"                  0.18      34.6      0.18     34.6
>>> "close"                  0.06      11.5      0.02      3.8
>>> "close.connection"       0.04       7.7      0.04      7.7
>>> "file.exists"            0.02       3.8      0.02      3.8
>>> "gc"                     0.02       3.8      0.02      3.8
>>> "gzfile"                 0.02       3.8      0.02      3.8
>>> "list"                   0.02       3.8      0.02      3.8
>>> "file.path"              0.02       3.8      0.00      0.0
>>>
>>> $sampling.time
>>> [1] 0.52
>>>
>>> there is little tiime unaccounted for, and 0.38 sec is going in 
>>> .readRDS and grepl.  Whereas
>>>
>>> system.time({
>>> ans <- character(0)
>>> for(i in 1:2096) ans <- c(ans, "foo")
>>> })
>>>
>>> takes 0.024 secs, negligible here (one profiler tick).
>> Here is what happens to me if I restart the computer:
>>
>>> Rprof( )
>>> system.time( packs <- .packages( all = T ) )
>>  user  system elapsed
>> 0.888   0.342  35.589
>>> Rprof(NULL)
>>> summaryRprof()
>> $by.self
>>                  self.time self.pct total.time total.pct
>> ".readRDS"              0.34     28.8       0.64      54.2
>> ".packages"             0.14     11.9       1.16      98.3
>> "file.exists"           0.14     11.9       0.14      11.9
>> "gzfile"                0.12     10.2       0.16      13.6
>> "close"                 0.10      8.5       0.14      11.9
>> "grepl"                 0.08      6.8       0.10       8.5
>> "$"                     0.08      6.8       0.08       6.8
>> "file.path"             0.06      5.1       0.06       5.1
>> "close.connection"      0.04      3.4       0.04       3.4
>> "getOption"             0.02      1.7       0.04       3.4
>> "as.character"          0.02      1.7       0.02       1.7
>> "gc"                    0.02      1.7       0.02       1.7
>> "options"               0.02      1.7       0.02       1.7
>> "system.time"           0.00      0.0       1.18     100.0
>>
>> $by.total
>>                  total.time total.pct self.time self.pct
>> "system.time"            1.18     100.0      0.00      0.0
>> ".packages"              1.16      98.3      0.14     11.9
>> ".readRDS"               0.64      54.2      0.34     28.8
>> "gzfile"                 0.16      13.6      0.12     10.2
>> "file.exists"            0.14      11.9      0.14     11.9
>> "close"                  0.14      11.9      0.10      8.5
>> "grepl"                  0.10       8.5      0.08      6.8
>> "$"                      0.08       6.8      0.08      6.8
>> "file.path"              0.06       5.1      0.06      5.1
>> "close.connection"       0.04       3.4      0.04      3.4
>> "getOption"              0.04       3.4      0.02      1.7
>> "as.character"           0.02       1.7      0.02      1.7
>> "gc"                     0.02       1.7      0.02      1.7
>> "options"                0.02       1.7      0.02      1.7
>>
>> $sampling.time
>> [1] 1.18
>>
>> I'd like to know what is happening in the 35.589  - 1.18 seconds, and 
>> the profiler won't tell me.
>>
>> About the time spent by grepl, we could take this down by calling it 
>> once instead of many times
>>
>>> system.time( grepl( valid_package_version_regexp, versions ) )
>>  user  system elapsed
>> 0.003   0.000   0.009
>>> system.time( for(v in versions) grepl( valid_package_version_regexp, 
>>> v ) ) 
>> user  system elapsed
>> 0.100   0.000   0.136
>>
>>>> Consider this script below,
>>>
>>> Whether profiling works in other examples is beside the point here.
>> I'll make another thread on that.
>>
>> -- 
>> Romain Francois
>> Independent R Consultant
>> +33(0) 6 28 91 30 30
>> http://romainfrancois.blog.free.fr
>>
>>
>


-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr



More information about the R-devel mailing list