[Rd] Feature request for 'sprintf' optimization (PR#9621)

mwtoews at sfu.ca mwtoews at sfu.ca
Thu Apr 19 05:20:22 CEST 2007


Full_Name: Michael Toews
Version: R-devel and 2.4.1
OS: Debian etch and WindowsXP
Submission from: (NULL) (142.58.206.114)


This is a quick demonstration of the present time limitation of 'sprintf' on
long vectors with a suggestion for significant optimization.

First, consider a data.frame with numeric (double) values:

dat <- data.frame(year=as.numeric(rep(1970:2000,each=365)),
                  yday=as.numeric(1:365))
nrow(dat)

Consider using 'sprintf' in R with and without casting the arrays:

wocast <- system.time(with(dat,sprintf("%04i-%03i",year,yday)))
wcast  <- system.time(with(dat,sprintf("%04i-%03i",as.integer(year),
                                                   as.integer(yday))))
100*wocast/wcast # as a percent comparison

My results on a Debian VM with R-devel (r41236) have elapsed ratios of 63408%,
and on Windows XP with R 2.4.1 of 23300%. Using a similar data frame to 'dat'
except, much longer (using 1900:2100 for year; nrow=73365) result in ratios of
120775%. Certainly, the time of the 'sprintf' wrapper is dependent not only on
processor and platform, but more significantly on the data types of the '...'
values passed to the wrapper.

The first and simplest suggestion is to document in 'sprnitf' that it is
significantly faster when supplied with values in the intended data type for
'fmt' through casting (namely using 'as.integer'). However, to the user it would
seem that they have to specify the format twice (e.g., once for '%i' and the
second for 'as.integer()').

A second and more elegant suggestion is for 'sprintf' (or called C code) is to
parse 'fmt' for the data types, and cast the values from '...' according to
those types before continuing with the wrapper call.

(I have not looked at the source code, nor am I good C programmer, so I can't do
more than suggest -- it is possible there could be an alternate optimizations in
the wrapper, since the processing time is very dependent on the length of the
'...' vectors, and it might be evaluating the values repeatedly in a 'for'
loop.)

Thanks!
+mt



More information about the R-devel mailing list