[R] Basic general statistical problem.

Sat Feb 28 12:21:33 CET 2004

On Sat, 28 Feb 2004, Fulvio Copex wrote:

> Hello everyone,
> I'd like to have suggestions about a common basic statistical approach, 
> hope to be useful also for other R beginners.

Isn't this about statistics beginners?  You seem to be assuming univariate 
data and a very specific distribution form.  There are standard 
procedures, but yours seems erroneous.

> When you first get some data (i.e length of river) you may want to look at its distribution.
> Then you probably want to find which law follows this distribution, 
> and to test the goodness of the fit.
>  
> For doing this simple analysis I am writing some code schematically represented below: 
> 1)calculate the histogram:
> myhist<-hist(myData, breaks=....)
> Question: how to calculate the standard deviation of each class of the hist? I have not seen it in the output of hist

The counts are jointly multinomial.  Approximately each count is 
Poisson-distributed, so the standard deviation is approximately the square 
root of the count.  (You seem to be assuming that at 3).)

> 2)Looking at the graph it seems to follow a linear model:

What does that mean?  The histogram is a density estimate, and 
theoretically a pdf cannot be linear unless you restrict the range, and 
even then you would want to constrain the estimation.  You want to the fit 
the pdf to the actual data, not grouped data, if you can.

> I plot the points: points(myhist$mids,myhist$counts)
> Question: How to plot also the weights (vertical segments)?

> 3)I Calculate the linear equation using "lm" (in the case of linear
> model) knowing the weights computed in points 2).

> 4)To test the goodness of the fit, a simply way is to use the reduced
> chi squared test which I haven't found on the base package. But it is
> simple to calculate like this
> chisq.reduced<-(1/N)*sum((e-o)/w^2)
> where e=expected values from fit
> o=observed values
> w=weights

That is not a correct test, as the counts are not independent (they must 
sum to one).  I presume you have a typo: it is (o-e)^2/e, where I think e 
and w are probably the same thing.

You can use chisq.test to do the correct test.

> 5) Conclusion: If my chisq is lower than 1 I can conclude the model
> approximate well my data distribution.

No, you need to refer the correct statistic to a proper chi-squared
distribution.  If you fit parameters of the distribution, the theory
assumes that you fitted them by maximum-likelihood (to the grouped data).

> Is it a good analysys of the problem?

No.

> Any answers to my questions or a better standard procedure (or package)
> where this work can be done easily, for the basic kind of distribution
> types?

?fitdistr (in MASS) for how to fit univariate distributions, and ?density
for how to find non-parametric density estimates.  ?chisq.test for
Chi-squared test of goodness of fit.  library(stepfun); ?ecdf for other
ways to examine data, and ?ks.test for other fit statistics which may be
more appropriate for continuous univariate measurements.

> Any kind of answer should be appreciated, including documentations or
> tutorial.

This sort of thing is covered in most introductory statistics classes and 
texts.   I think you need to seek the advice of a local statistical 
consultant.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595