[R] RFC for package PopCon: a popularity contest for R and packages

Jeffrey Horner jeff.horner at vanderbilt.edu
Thu Feb 14 15:57:47 CET 2008


(I posted this to the R-devel list yesterday, but I thought others on 
this list would be interested, so sorry for those who get it twice.)

Hello all,

I've developed a prototype package called PopCon (short for popularity 
contest), a package for tracking the popularity of R and its packages. 
I'd like this work to be similar in spirit to the Debian package 
popularity-contest: http://popcon.debian.org/.

Once Popcon is loaded, it captures two kinds of information from the 
user and stores it into a cache: the names of the libraries he/she 
loads, and the names of symbols requested from his/her code. Once the 
cache is full, the goal is to flush the data to a central server for 
storage, free for anyone to download and analyze. That's it. Pretty 
simple use and works behind the scenes. You can get the prototype here:

http://biostat.mc.vanderbilt.edu/twiki/pub/Main/JeffreyHorner/PopCon_0.1.tar.gz

And note that flushing of the cache is NOT TURNED ON and IT WON'T 
FORWARD ANY DATA ANYWHERE! It only gets deleted.

So, I envision all the software and data generated and stored to be 
licensed under a GPL and a Creative Commons license, or even public domain.

Thoughts? I'm looking for volunteers, because there are many issues to 
hash out. Here's a few of them:

1. Obviously storing IP addresses or any bit of personal information is 
out, but I'm interested in generating a permanent random key of some 
sort so that data from the same R installs can be tracked. I'm wondering 
if just md5 hashing the combination of R version, platform, and IP 
address would be appropriate and reproducible per R install. The debian 
package popularity-contest has the benefit of installing an '/etc' 
config file and generating the key once, while I'd like PopCon users to 
just call 'library(PopCon)' and do nothing else.

2. I'm willing to maintain the central server and work on the 
infrastructure, but help will definitely be needed. Also, if there's 
significant interested, maybe R core would be interested in this.

3. What exactly is PopCon tracking as far as symbol names go? It 
currently used an R_ObjectTable object attached to the search path to 
capture names, but is this the best way? see 
http://www.omegahat.org/RObjectTables/. It's also replacing 
base::getHook to trap library loads.

4. What else would be interesting to track? Some folks have suggested 
various bits of R.Version() output.

Here's what PopCon can currently do:

 > library(PopCon)
 > search()
  [1] ".GlobalEnv"        "package:PopCon"    ".pcUDB"
  [4] "package:stats"     "package:graphics"  "package:grDevices"
  [7] "package:utils"     "package:datasets"  "package:methods"
[10] "Autoloads"         "package:base"

# Notice the above search entry .pcUDB. That's the R Object Table

 > typeof(PopCon::getCache())
[1] "character"
 > PopCon::getCache()
[1] ".conflicts.OK" "search"        "::"

# Now the cache contains the name 'search', which I called above,
# and the double colon operator.

 > library(cluster)
 >  any(PopCon::getCache()=='package:cluster')
[1] TRUE

# Package names are represented in the PopCon cache just like
# their name on the search path.


 > PopCon::getCache()
   [1] ".conflicts.OK"            "search"
   [3] "::"                       "$.data.frame"
   [5] "$.default"                "$.data.frame"
   [7] "$.default"                "unique.integer"
   [9] "unique.numeric"           "$.data.frame"
  [11] "$.default"                "unique.integer"
  [13] "unique.numeric"           "unique.character"
  [15] "unique.integer"           "unique.numeric"
  [17] "close.gzfile"             "$.packageDescription2"
  [19] "$.default"                "$.data.frame"
  [21] "$.default"                "unique.integer"
  [23] "unique.numeric"           "unique.character"
  [25] "unique.integer"           "unique.numeric"
  [27] "close.gzfile"             "$.packageDescription2"
  [29] "$.default"                "unique.integer"
  [31] "unique.numeric"           "close.gzfile"
  [33] "names.simple.list"        "names.default"
  [35] "[.default"                "as.character.simple.list"
  [37] "as.vector.simple.list"    "as.vector.default"
  [39] "unique.character"         "$.packageDescription2"
  [41] "$.default"                ">=.R_system_version"
  [43] "Ops.R_system_version"     ">=.package_version"
  [45] "Ops.package_version"      ">=.numeric_version"
  [47] ">=.package_version"       "Ops.package_version"
  [49] ">=.numeric_version"       "unlist.R_system_version"
  [51] "unlist.package_version"   "unlist.numeric_version"
  [53] "unlist.default"           "unlist.package_version"
  [55] "unlist.numeric_version"   "unlist.default"
  [57] "as.list.R_system_version" "as.list.package_version"
  [59] "unique.integer"           "unique.numeric"
  [61] "as.list.R_system_version" "as.list.package_version"
  [63] "unique.integer"           "unique.numeric"
  [65] "as.list.package_version"  "unique.integer"
  [67] "unique.numeric"           "as.list.package_version"
  [69] "unique.integer"           "unique.numeric"
  [71] ">=.default"               "$.packageDescription2"
  [73] "$.default"                "<.R_system_version"
  [75] "Ops.R_system_version"     "<.package_version"
  [77] "Ops.package_version"      "<.numeric_version"
  [79] "unique.character"         "unlist.R_system_version"
  [81] "unlist.package_version"   "unlist.numeric_version"
  [83] "unlist.default"           "unlist.numeric_version"
  [85] "unlist.default"           "as.list.R_system_version"
...
# I've truncated the output here.

But you get the idea. Any and all comments welcome.

Jeff
-- 
http://biostat.mc.vanderbilt.edu/JeffreyHorner



More information about the R-help mailing list