[R] Size of R user base
Philippe Grosjean
phgrosjean at sciviews.org
Wed Apr 21 17:27:59 CEST 2004
Thank you, Gabor, for these stats. Here is what I did with it.
Philippe Grosjean
===============
# This is monthly R-help usage as given by the size of gzipped archives over
the last 7 years
RhelpUsage <- ts(c(55, 19, 19, 18, 19, 17, 35, 27, 47, 55, 32, 50, 55, 41,
49, 50, 28, 53, 42,
81, 54, 99, 60, 84, 80, 76, 75, 78, 61, 83, 97, 141, 122,
96, 144, 173, 153, 226,
202, 131, 165, 183, 175, 168, 187, 240, 272, 262, 195, 236,
244, 285, 249, 326, 345, 392, 268,
455, 320, 418, 453, 468, 422, 447, 400, 323, 516, 478, 327,
450, 487, 535, 658, 573, 606, 659,
543, 655, 722, 677, 567, 519, 703, 886), start = 1997.25,
deltat = 1/12)
time(RhelpUsage)
plot(RhelpUsage)
# OK, log() is probably a good transformation, given heteroscedasticity
(looks like multiplicative error)
LogRhU <- log(RhelpUsage)
plot(LogRhU)
# Humm... may be monthly sample is not the best interval (i.e., the one that
optimize signal/noise ratio?
# I can check that using the information theory, thanks to the turnogram()
function in pastecs:
library(pastecs)
RhU.turno <- turnogram(RhelpUsage, FUN = sum) # Info according to turning
points with different intervals
plot(RhU.turno)
summary(RhU.turno)
# Clearly, the signal/noise ratio is optimal for semester archives, let's
extract such a series...
RhU6 <- extract(RhU.turno)
plot(RhU6)
# ... and
LogRhU6 <- log(RhU6)
plot(LogRhU6)
# Well, we are obviously still in the ascending phase!
LogRhU6.lm <- lm(LogRhU6 ~ time(LogRhU6))
summary(LogRhU6.lm)
abline(coef = coef(LogRhU6.lm), col = 2)
plot(LogRhU6) # Not linear...
# ... I don't like much polynomial regression
# I prefer to fit a simple asymptotic growth model
df1 <- data.frame(time=as.vector(time(LogRhU6)), load=as.vector(LogRhU6))
LogRhU6.nls <- nls(load ~ SSasympOff(time, Asym, lrc, c0), data = df1)
summary(LogRhU6.nls)
plot(df1$time, df1$load)
lines(df1$time, predict(LogRhU6.nls), col = 2)
# Not too bad! (well, nls ignores autocorrelation, but let's pretend it is
correct)
# A graph in estimated number of messages (according to ratio messages /
gzip size in March 2004):
plot(df1$time, exp(df1$load) * 1949 / 886, xlab = "time (years)", ylab =
"nbr. of messages / semester", main = "R-help mailing list usage
(estimation)")
lines(df1$time, exp(predict(LogRhU6.nls)) * 1949 / 886, col = 2)
abline(v = 1998:2003, col= "gray", lty = 2)
abline(h = (1:9)*1000, col= "gray", lty = 2)
# I like this graph and it makes me feel how fast the activity
# of R-help mailing list increases. In the near future, I should
# remember to switch to a "digest" mode of this list!
# Now... let's be silly and let's do some stupid extrapolations:
# 1) according to the model, the R-help mailing list started in:
coef(LogRhU6.nls)[3] # April 1992 (this model is notorious for bad estimate
of initial date for growth)
# 2) Asymptotic maximum monthly number of messages in R-help list is
# given an indication of 1949/886 = 2.2 messages per KB (March 2004)
exp(coef(LogRhU6.nls)[1]) / 6 * 1949 / 886
# that is 25,000 monthly messages (OK, really, really stupid... that's just
for fun!)
# For the rest, I am not good at all,...
# and I am not an "extrapolator", so, do not expect I would predict the
number of
# messages in R-help mailing list for a next year or so!
=================
-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch]On Behalf Of Gabor Grothendieck
Sent: Wednesday, 21 April, 2004 15:14
To: r-help at stat.math.ethz.ch
Subject: Re: [R] Size of R user base
Philippe Grosjean <phgrosjean <at> sciviews.org> writes:
> We have also the activity in the R-help mailing list, which could be
> representative of the most active users, certainly. Does anyone have of
> figure of the number of messages in R-Help with time since its creation?
(it
> is probably available somewhere, but I don't know where).
If you check the r-help archive for last month
https://www.stat.math.ethz.ch/pipermail/r-help/2004-March/date.html
at CRAN it says at the top there were 1949 messages for March.
Looking at
https://www.stat.math.ethz.ch/pipermail/r-help/
it shows the Gzip's size of each month's archives and from that
March had 886 KB of Gzip's text from which we can estimate 1949/886
= 2.2 messages per KB. Over the last number of months there were
the following number of G'zipped KB for successive months over the
last 84 months:
55 19 19 18 19 17 35 27 47 55 32 50 55 41 49 50 28 53 42
81 54 99 60 84 80 76 75 78 61 83 97 141 122 96 144 173 153 226
202 131 165 183 175 168 187 240 272 262 195 236 244 285 249 326 345 392 268
455 320 418 453 468 422 447 400 323 516 478 327 450 487 535 658 573 606 659
543 655 722 677 567 519 703 886
where the last point 886KB is March 2004. Summing those number and
using 2.2 messages per KB gives an estimate of about 50,000 messages
over that period of time.
Fitting a log linear model to those numbers gives:
log(KB) = 3.2 + .043 i
where i is the month number which indicates that the archive size
(and hence the number of messages and possibly the user base) is
growing at 4% per month!
P.S. The following web page gives the number of messages per day over
the last few days as a graph:
http://gmane.org/info.php?group=gmane.comp.lang.r.general
______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html
More information about the R-help
mailing list