[R] interpreting profiling output
Kasper Daniel Hansen
k.hansen at biostat.ku.dk
Fri Jul 16 21:59:57 CEST 2004
I have some trouble interpreting the output from profiling. I have
read the help pages Rprof, summaryRprof and consult the R extensions
manual, but I still have problems understanding the output.
Basically the output consist of self.time and total.time. I have the
understanding that total.time is the time spent in a given function
including any subcalls or child functions or whatever the technical
term for that activity is. In contrasts self.time is the time spent in
the function excluding subcalls.
Now, in my understanding basically everything in R is functions. I
would then guess that for almost any function (except the "atomic
ones") the self.time would be very small as it would spend most of its
time calling other functions (again, since almost everything is a
function). So how do R determine when a subfunction is called?
Looking at a practical example, which basically consists of a model
search, with each model being evaluated by a nnet fit, I get (using
the default value of interval=0.02 - the output is basically similar
for interval = 0.01)
> prof02.out$by.total[1:10,]
total.time total.pct self.time self.pct
cv.risk.dsa2 2373.50 100.0 0.00 0.0
validation.risk.dsa2 2373.50 100.0 0.00 0.0
avg.pred.nnet 2370.98 99.9 4.36 1.5
dsa.2 2337.32 98.5 0.00 0.0
best.move.dsa2 2335.22 98.4 0.06 0.0
avg.nnet 2335.10 98.4 0.04 0.0
nnet 2188.22 92.2 1.74 0.6
nnet.default 2186.48 92.1 33.32 11.2
sub.dsa2 1466.32 61.8 0.18 0.1
del.dsa2 560.36 23.6 0.02 0.0
This tells me that the majority (92%) is spent in the nnet
fit. Therefore it seems that the only real improvement to the code
would be to change the algorithm in order to call nnet fewer
times. The actual search code only amounts to a little amout of the
total time.
Looking at sampling.time, the fourth column (total.time in percent) is
the third column (total.time) divided by sampling.time (which makes
sense).
My trouble really begins when I look at the self.time column.
> prof02.out$by.self[1:10,]
self.time self.pct total.time total.pct
nnet.default 33.32 11.2 2186.48 92.1
FUN 23.42 7.8 46.34 2.0
as.integer 22.28 7.5 30.92 1.3
mean.default 16.14 5.4 22.92 1.0
as.double.default 11.52 3.9 11.52 0.5
apply 10.22 3.4 60.80 2.6
as.double 10.14 3.4 30.24 1.3
rep.default 9.98 3.3 19.96 0.8
matrix 9.62 3.2 75.86 3.2
add.net 8.80 2.9 10.16 0.4
First of, I would guess that the sum of the self.time column
_ought_??? to be equal to the sampling time, which it is not
> sum(prof02.out$by.self[,1])
[1] 298.4
> prof02.out$sampling.time
[1] 2373.5
Why does such a large discrepancy happen?
Secondly, looking at the output for as.integer, there is a big
difference between total.time and self.time - but I would _guess_
as.integer to be a pretty basic function, which does not call any
subfunctions, and in that case I would expect total.time to be equal
to self.time.
In short : I think I grasp the total time concept, but I have a hard
time understanding self.time. Specifically
- Why is the sampling time so different from the sum of the self.time
- Which function calls (or operations) are considered to belong to a
given function's self.time and which are not.
- Why is total.time != self.time for the as.integer function.
I guess some of the answers would be clear if I had a firm grasp of
the inner workings of R :)
Thanks in advance, Kasper
--
Kasper Daniel Hansen, Research Assistant
Department of Biostatistics, University of Copenhagen
More information about the R-help
mailing list