[R] Long-tail model in R ... anyone?

ocelma at iua.upf.edu ocelma at iua.upf.edu
Wed Jul 4 19:25:45 CEST 2007


Dear all,

first I would like to tell you that I've been using R for two days... (so,
you can predict my knowledge of the language!).

Yet, I managed to implement some stuff related with the Long-Tail model [1].
I did some tests with the data in table 1 (from [1]), and plotted figure 2
(from [1]). (See R code and CSV file at the end of the email)

Now, I'm stuck in the nonlinear regression model of F(x). I got a nice error:
"
Error in nls(~F(r, N50, beta, alfa), data = dataset, start = list(N50 =
N50,  : singular gradient
"

And, yes, I've been looking for how to solve this (via this mailing list +
some google), and I could not come across to a proper solution. That's why
I am asking the experts to help me! :-)

So, any help would be much appreciated...

Cheers, Oscar
[1] http://www.firstmonday.org/issues/issue12_5/kilkki/

PS: R code and CVS file

FILE: "data.R" (data taken from [1] Table 1, columns 1 and 2)
--8=<-------------------
"rank","cum_value"
10,     17396510
32,     31194809
96,     53447300
420,    100379331
1187,   152238166
24234,  432238757
91242,  581332371
294180, 650880870
1242185,665227287
-->=8-------------------

R CODE:

#
# F(x). The long-tail model
# Reference: http://www.firstmonday.org/issues/issue12_5/kilkki/
# Params:
#       x   :   Rank (either an integer or a list)
#       N50 :   the number of objects that cover half of the whole volume
#       beta:   total volume
#       alfa:   the factor that defines the form of the function
F <- function (x, N50, beta=1.0, alfa=0.49)
{
        xx <- as.numeric(x) # as.numeric() prevents overflow
        Fx = beta / ( (N50/xx)^alfa + 1 )
        Fx
}

# Read CSV file (rank, cum_value)
lt <- read.csv(file="data.R",head=TRUE,sep=",")

r <- lt$rank
v <- lt$cum_value
pcnt <- v/v[length(v)] *100 # get cumulative percentage
plot(r, pcnt, log="x", type='l', xlab='Ranking', ylab='Cumulative
percentatge of sales', main="Books Popularity", sub="The long-tail
effect", col='blue')

# Set some default values to be used by F(x)...
alfa = 0.49
beta = 1.38
N50 = 30714

# Start using F(x). Results are in 'f' ...
f <- c(0) # oops! is this the best initialization for 'f'?
for (i in 1:24234) f[i] <- F(i, N50, beta, alfa)*100

# Plot some estimated values from F(x) (N50, beta, and alfa values come
from the paper. See ref. [1])
plot(f, log="x", type='l', xlab='Ranking', ylab='Cumulative percentatge of
sales', main="Books Popularity", sub="Plotting first values of F(x) and
some real points")
points(r, pcnt, col="blue") # adding the "real" points

# Create a dataset to be used by nls()
dataset <- data.frame(r, pcnt)

# Verifying that F(x) works fine... (comparing with the "real" values
contained in the dataset)

dataset
F(10, N50, beta, alfa) * 100
F(32, N50, beta, alfa) * 100
F(96, N50, beta, alfa) * 100
F(420, N50, beta, alfa) * 100
F(1187, N50, beta, alfa) * 100
F(24234, N50, beta, alfa) * 100
F(91242, N50, beta, alfa) * 100
F(294180, N50, beta, alfa) * 100
F(1242185, N50, beta, alfa) * 100

#dataset <- data.frame(pcnt) # which dataset should I use? Should I
include the ranks in it?
nls( ~ F(r, N50, beta, alfa), data = dataset, start = list(N50=N50,
beta=beta, alfa=alfa), trace = TRUE )



More information about the R-help mailing list