[R] fitting a lognormal distribution using cumulative probabilities

Sat Feb 23 10:24:21 CET 2008

Your example code is asserting that the events occurred at the times in 
'obs.time', not before those times.

Surv(event = 1) means uncensored.  If you try event = 0, the fitter 
diverges towards an exact fit, as I said it should.

Yes, you will get a good fit to the ECDF, but you are modelling the 
observation times, not the event times (and asserting that a continuous 
distribution fits discrete data).

I suggest you seek local statistical help: these are conceptual rather 
than R issues.

(BTW, using a .com address suggests you are a COMpany and in the absence 
of a signature block that will influence how much free help you are 
offered.)

On Sat, 23 Feb 2008, ahimsa campos-arceiz wrote:

> Dear Prof Ripley and Dimitris (cc R-list),
>
> thank you very much for your very insightful responses.
>
> I've been checking how to use survreg{survival} to fit a left-censored
> lognormal distribution, and I was surprised to find that results are exactly
> the same as with fitdistr{MASS}. Here is an example with a larger dataset:
>
> #====================
> # data:
> data1 <- data.frame(obs.time=c(30,31.98,35.95,37.2,46.4,50.5,54,56,60,62.95,
>                  69.4,71.5,74,76,78,79.25,81.1,84.68,90,95.37,100),
>                  reps=c(5,47,80,18,20,32,29,8,29,2,7,8,1,4,3,2,3,3,1,2,1))
>
> data2 <- with(data1, data.frame(obs.time=rep(obs.time, reps),
>                                event=rep(1,sum(data1$reps))))
>
> # 1. using fitdistr to estimate lognormal parameters
> library(MASS)
> fit1 <- fitdistr(data2$obs.time, "lognormal"); fit1
> #     meanlog       sdlog
> #   3.80552970   0.29093294
> #  (0.01665877) (0.01177953)
>
> # 2. using survreg for left-censored estimation
> library(survival)
> fm <- survreg(Surv(obs.time,event,type="left")~1,
>      data=data2, dist="lognormal"); summary(fm)
> fm$coefficients[[1]]; fm$scale
> # [1] 3.80553
> # [1] 0.2909329
> # results are the identical !!
>
> # plotting ecdf and the fitted curve  =========
> # parameters
> fit1.mean <- fit1$estimate[[1]]
> fit1.sd <-  fit1$estimate[[2]]
>
> plot(ecdf(data2$obs.time), verticals=TRUE, do.p=FALSE, lwd=1.5,
>    xlim=c(0,116), ylab="pb", xlab="time")
> curve(plnorm(x, meanlog=fit1.mean, sdlog=fit1.sd), add=TRUE, col='red',
> lwd=3)
> # the fit looks good to me
>
> #============= end script =========================
>
> I guess this relates to the following sentences from Prof Ripley:
>
> ML fitting can be done for censored data.  However, I don't think
> you have a valid description here: it seems you never recorded a time at
> which the event had not happened, and the most likely fit is a probability
> mass at zero (since this is a perfect explanation for your data).
>
> (which I don't fully understand) and
>
> If you have an ECDF, the jumps give you the data so you can just use
> fitdistr().
>
> I have the ECDF and the number of observations is ~150-300. My final
> understanding is therefore that fitdistr can be used to fit distributions by
> ML over left-censored data, which is equivalent to fit the distribution
> using cumulative probabilities (to go back to my original terminology).
>
> Is this understanding correct? I would highly appreciate any feedback,
> especially if I am misunderstanding the way to estimate the function
> parameters for this type of data.
>
> Thank you very much!
>
> Ahimsa
>
>
>
>
> On Sat, Feb 23, 2008 at 2:35 AM, Prof Brian Ripley <ripley at stats.ox.ac.uk>
> wrote:
>
>> On Sat, 23 Feb 2008, ahimsa campos-arceiz wrote:
>>
>>> Dear all,
>>>
>>> I'm trying to estimate the parameters of a lognormal distribution fitted
>>> from some data.
>>>
>>> The tricky thing is that my data represent the time at which I recorded
>>> certain events. However, in many cases I don't really know when the
>> event
>>> happened. I' only know the time at which I recorded it as already
>> happened.
>>
>> So this is a rather extreme form of censoring.
>>
>>> Therefore I want to fit the lognormal from the cumulative distribution
>>> function (cdf) rather than from the probability distribution function
>> (pdf).
>>>
>>> My understanding is that methods based on Maximum Likelihood (e.g.
>> fitdistr
>>> {MASS}) are based on the pdf. Nonlinear least-squares methods seem to be
>>> based on the cdf... however I was unable to use nls{stat} for lognormal.
>>
>> Not so: ML fitting can be done for censored data.  However, I don't think
>> you have a valid description here: it seems you never recorded a time at
>> which the event had not happened, and the most likely fit is a probability
>> mass at zero (since this is a perfect explanation for your data).
>>
>> To make any progress with censoring, you need to see both positive and
>> negative events.  If you told us that none of these events happened before
>> t=15, it would be possible to fit the model (although you would need far
>> more data to get a good fit).
>>
>> Generally code to handle censoring is in survival analysis: e.g. survreg()
>> in package survival.  In the terminiology of the latter, all your
>> observations are left-censored.
>>
>>> I found a website that explains how to fit univariate distribution
>> functions
>>> based on cumulative probabilities, including a lognormal example, in
>> Matlab:
>>>
>> http://www.mathworks.com/products/statistics/demos.html?file=/products/demos/shipping/stats/cdffitdemo.html
>>>
>>> and other programs like TableCurve 2D seem to do this too.
>>
>> Maybe, but that is a different problem.  If you have an ECDF, the jumps
>> give you the data so you can just use fitdistr().  (And you will see
>> comparing observed and fitted CDFs in MASS, the book.)
>>
>>
>>> There must be a straightforward method in R which I have overlooked. Any
>>> suggestion on how can I estimate these parameters in R or helpful
>> references
>>> are very much appreciated.
>>>
>>> (not sure if it helps but) here is an example of my type of data:
>>>
>>> treat.1 <- c(21.67, 21.67, 43.38, 35.50, 32.08, 32.08, 21.67, 21.67,
>> 41.33,
>>>        41.33, 41.33, 32.08, 21.67, 22.48, 23.25, 30.00, 26.00, 19.37,
>> 26.00
>>> ,
>>>        32.08, 21.67, 26.00, 26.00, 43.38, 26.00, 21.67, 22.48, 35.50,
>> 38.30,
>>>
>>>        32.08)
>>>
>>> treat.2 <- c(35.92, 12.08, 12.08, 30.00, 33.cy73, 35.92, 12.08, 30.00,
>> 56.00,
>>>        30.00, 35.92, 33.73, 12.08, 26.00, 54.00, 12.08, 12.08, 35.92,
>> 35.92
>>> ,
>>>        12.08, 33.73, 35.92, 63.20, 30.00, 26.00, 33.73, 23.50, 30.00,
>> 35.92
>>> ,
>>>        30.00)
>>>
>>> Thank you very much!
>>>
>>> Ahimsa
>>>
>>>
>>> --
>>> ahimsa campos-arceiz
>>> www.camposarceiz.com
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>> --
>> Brian D. Ripley,                  ripley at stats.ox.ac.uk
>> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/<http://www.stats.ox.ac.uk/%7Eripley/>
>> University of Oxford,             Tel:  +44 1865 272861 (self)
>> 1 South Parks Road,                     +44 1865 272866 (PA)
>> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>>
>
>
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595