[R] Size of R user base

Gabor Grothendieck ggrothendieck at myway.com
Wed Apr 21 20:03:09 CEST 2004


Phillippe, Intriguing analysis.  

Lets pursue this but assume that instead of the archive size reaching
a monthly plateau that the cumulative archive size reaches a limit
at which point it ceases to grow and the lifetime of R is effectively
over.   We can fit a logistic growth curve to the cumulative KB size
as shown below.  From it we reach a number of interesting conclusions.
The archive will grow to 40MB which represents a cumulative number of
messages of about 88,000.  Note that this just about double what we
have seen to date which means that half the messages in the archive
that will ever be there are there now and we are just about at the
point of inflection in R's lifespan.  Furthermore, the amount of
time in months for R to grow from 10% of its plateau archive size
to 90% is log(81)/coef(res)[3] which is 73 months and is the time
that R can be expected to be the most vibrant, i.e. from the
time it began with some momentum to the point where it starts
saturating out.

   tt <- seq(84)
   res <- nls(cumsum(RhelpUsage) ~ a/(1+exp(b-c*tt)), start=list
(a=10000,b=1,c=.04))
   summary(res)

which gives:

Formula: cumsum(RhelpUsage) ~ a/(1 + exp(b - c * tt))

Parameters:
   Estimate Std. Error t value Pr(>|t|)    
a 3.996e+04  1.087e+03   36.77   <2e-16 ***
b 4.960e+00  1.831e-02  270.84   <2e-16 ***
c 6.038e-02  6.766e-04   89.24   <2e-16 ***
---
Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 

Residual standard error: 155.4 on 81 degrees of freedom

Correlation of Parameter Estimates:
         a      b
b -0.04244       
c -0.92249 0.4191


Philippe Grosjean <phgrosjean <at> sciviews.org> writes:

: 
: Thank you, Gabor, for these stats. Here is what I did with it.
: 
: Philippe Grosjean
: 
: ===============
: # This is monthly R-help usage as given by the size of gzipped archives over
: the last 7 years
: RhelpUsage <- ts(c(55, 19, 19, 18, 19, 17, 35, 27, 47, 55, 32, 50, 55, 41,
: 49, 50, 28, 53, 42,
:                 81, 54, 99, 60, 84, 80, 76, 75, 78, 61, 83, 97, 141, 122,
: 96, 144, 173, 153, 226,
:                 202, 131, 165, 183, 175, 168, 187, 240, 272, 262, 195, 236,
: 244, 285, 249, 326, 345, 392, 268,
:                 455, 320, 418, 453, 468, 422, 447, 400, 323, 516, 478, 327,
: 450, 487, 535, 658, 573, 606, 659,
:                 543, 655, 722, 677, 567, 519, 703, 886), start = 1997.25,
: deltat = 1/12)
: time(RhelpUsage)
: plot(RhelpUsage)
: 
: # OK, log() is probably a good transformation, given heteroscedasticity
: (looks like multiplicative error)
: LogRhU <- log(RhelpUsage)
: plot(LogRhU)
: 
: # Humm... may be monthly sample is not the best interval (i.e., the one that
: optimize signal/noise ratio?
: # I can check that using the information theory, thanks to the turnogram()
: function in pastecs:
: library(pastecs)
: RhU.turno <- turnogram(RhelpUsage, FUN = sum) # Info according to turning
: points with different intervals
: plot(RhU.turno)
: summary(RhU.turno)
: # Clearly, the signal/noise ratio is optimal for semester archives, let's
: extract such a series...
: RhU6 <- extract(RhU.turno)
: plot(RhU6)
: # ... and
: LogRhU6 <- log(RhU6)
: plot(LogRhU6)
: 
: # Well, we are obviously still in the ascending phase!
: LogRhU6.lm <- lm(LogRhU6 ~ time(LogRhU6))
: summary(LogRhU6.lm)
: abline(coef = coef(LogRhU6.lm), col = 2)
: 
: plot(LogRhU6)	# Not linear...
: 
: # ... I don't like much polynomial regression
: # I prefer to fit a simple asymptotic growth model
: df1 <- data.frame(time=as.vector(time(LogRhU6)), load=as.vector(LogRhU6))
: LogRhU6.nls <- nls(load ~ SSasympOff(time, Asym, lrc, c0), data = df1)
: summary(LogRhU6.nls)
: 
: plot(df1$time, df1$load)
: lines(df1$time, predict(LogRhU6.nls), col = 2)
: # Not too bad! (well, nls ignores autocorrelation, but let's pretend it is
: correct)
: 
: # A graph in estimated number of messages (according to ratio messages /
: gzip size in March 2004):
: plot(df1$time, exp(df1$load) * 1949 / 886, xlab = "time (years)", ylab =
: "nbr. of messages / semester", main = "R-help mailing list usage
: (estimation)")
: lines(df1$time, exp(predict(LogRhU6.nls)) * 1949 / 886, col = 2)
: abline(v = 1998:2003, col= "gray", lty = 2)
: abline(h = (1:9)*1000, col= "gray", lty = 2)
: # I like this graph and it makes me feel how fast the activity
: # of R-help mailing list increases. In the near future, I should
: # remember to switch to a "digest" mode of this list!
: 
: # Now... let's be silly and let's do some stupid extrapolations:
: # 1) according to the model, the R-help mailing list started in:
: coef(LogRhU6.nls)[3] # April 1992 (this model is notorious for bad estimate
: of initial date for growth)
: 
: # 2) Asymptotic maximum monthly number of messages in R-help list is
: #    given an indication of 1949/886 = 2.2 messages per KB (March 2004)
: exp(coef(LogRhU6.nls)[1]) / 6 * 1949 / 886
: # that is 25,000 monthly messages (OK, really, really stupid... that's just
: for fun!)
: 
: # For the rest, I am not good at all,...
: # and I am not an "extrapolator", so, do not expect I would predict the
: number of
: # messages in R-help mailing list for a next year or so!
: =================
: 
: -----Original Message-----
: From: r-help-bounces <at> stat.math.ethz.ch
: [mailto:r-help-bounces <at> stat.math.ethz.ch]On Behalf Of Gabor Grothendieck
: Sent: Wednesday, 21 April, 2004 15:14
: To: r-help <at> stat.math.ethz.ch
: Subject: Re: [R] Size of R user base
: 
: 
: Philippe Grosjean <phgrosjean <at> sciviews.org> writes:
: > We have also the activity in the R-help mailing list, which could be
: > representative of the most active users, certainly. Does anyone have of
: > figure of the number of messages in R-Help with time since its creation?
: (it
: > is probably available somewhere, but I don't know where).
: 
: If you check the r-help archive for last month
: 
: https://www.stat.math.ethz.ch/pipermail/r-help/2004-March/date.html
: 
: at CRAN it says at the top there were 1949 messages for March.
: 
: Looking at
: 
: 	https://www.stat.math.ethz.ch/pipermail/r-help/
: 
: it shows the Gzip's size of each month's archives and from that
: March had 886 KB of Gzip's text from which we can estimate 1949/886
: = 2.2 messages per KB.  Over the last number of months there were
: the following number of G'zipped KB for successive months over the
: last 84 months:
: 
: 55  19  19  18  19  17  35  27  47  55  32  50  55  41  49  50  28  53  42
: 81  54  99  60  84  80  76  75  78  61  83  97 141 122  96 144 173 153 226
: 202 131 165 183 175 168 187 240 272 262 195 236 244 285 249 326 345 392 268
: 455 320 418 453 468 422 447 400 323 516 478 327 450 487 535 658 573 606 659
: 543 655 722 677 567 519 703 886
: 
: where the last point 886KB is March 2004.  Summing those number and
: using 2.2 messages per KB gives an estimate of about 50,000 messages
: over that period of time.
: 
: Fitting a log linear model to those numbers gives:
: 
: 	log(KB) = 3.2 + .043 i
: 
: where i is the month number which indicates that the archive size
: (and hence the number of messages and possibly the user base) is
: growing at 4% per month!
: 
: P.S.  The following web page gives the number of messages per day over
: the last few days as a graph:
: 
: 	http://gmane.org/info.php?group=gmane.comp.lang.r.general




More information about the R-help mailing list