[R] sciplot question
spencer.graves at prodsyse.com
Mon May 25 16:01:05 CEST 2009
Frank E Harrell Jr wrote:
> spencerg wrote:
>> Dear Frank, et al.:
>> Frank E Harrell Jr wrote:
>>> Yes; I do see a normal distribution about once every 10 years.
>> To what do you attribute the nonnormality you see in most cases?
>> (1) Unmodeled components of variance that can generate
>> errors in interpretation if ignored, even with bootstrapping?
>> (2) Honest outliers that do not relate to the phenomena of
>> interest and would better be removed through improved checks on data
>> quality, but where bootstrapping is appropriate (provided the data
>> are not also contaminated with (1))?
>> (3) Situations where the physical application dictates a
>> different distribution such as binomial, lognormal, gamma, etc.,
>> possibly also contaminated with (1) and (2)?
>> I've fit mixtures of normals to data before, but one needs to be
>> careful about not carrying that to extremes, as the mixture may be a
>> result of (1) and therefore not replicable.
>> George Box once remarked that he thought most designed
>> experiments included split plotting that had been ignored in the
>> analysis. That is only a special case of (1).
>> Spencer Graves
> Those are all important reasons for non-normality of margin
> distributions. But the biggest reason of all is that the underlying
> process did not know about the normal distribution. Normality in raw
> data is usually an accident.
Might there be a difference between the physical and social
sciences on this issue?
The central limit effect works pretty well with many kinds of
manufacturing data, except that it is often masked by between-lot
components of variance. The first differences in log(prices) are often
long-tailed and negatively skewed. Standard GARCH and similar models
handle the long tails well but miss the skewness, at least in what I've
seen. I think that can be fixed, but I have not yet seen it done.
Social science data, however, often involve discrete scales where
the raters' interpretations of the scales rarely match any standard
distribution. Transforming to latent variables, e.g., via factor
analysis, may help but do not eliminate the problem.
Thanks for your comments.
More information about the R-help