[R] t test problem?

(Ted Harding) Ted.Harding at nessie.mcc.ac.uk
Wed Sep 22 13:07:07 CEST 2004


On 22-Sep-04 kan Liu wrote:
> Hi, Many thanks for your helpful comments and suggestions. The attached
> are the data in both log10 scale and original scale. It would be very
> grateful if you could suggest which version of test should be used. 
>  
> By the way, how to check whether the variation is additive (natural
> scale) or multiplicative (log scale) in R? How to check whether the
> distribution of the data is normal? 

As for additive vs multiplicative, this can only be judged in terms
of the process by which the values are created in the real world.
As for normality vs non-normality, an appraisal can often be made
simply by looking at a histogram of the data.

In your case, the commands
  hist(x,breaks=10000*(0:100))
  hist(y,breaks=10000*(0:100))
indicate that the distributions of x and y do not look at all
"normal", since they both have considerable positive skewness
(i.e. long upper tails relative to the main mass of the distribution).

This does strongly suggest that a logarithmic transformation would
give data which are more nearly normally distributed, as indeed
is confirmed by the commands
  hist(log(x))
  hist(log(y))
though in both cases the histograms show some irregularity compared
with what you would expect from a sample from a normal distribution:
the commands
  hist(log(x),breaks=0.2*(40:80))
  hist(log(y),breaks=0.2*(40:80))
show that log(x) has an excessive peak at around 11.7,
while log(y) has holes at around 11.1 and 12.1.

Nevertheless, this inspection of the data shows that the use of
log(x) and log(y) will come much closer to fulfilling the conditions
of validity of the t test than using the raw data x and y.

However, it is not merely the *normality* of each which is needed:
the conditions for the usual t test also require that the two
populations sampled for log(x) and log(y) should have the same
standard deviations. In your case, this also turns out to be
nearly enough true:

  > sd(log(x))
  [1] 0.902579
  > sd(log(y))
  [1] 0.9314807

> PS, Can I confirm that do your suggestions mean that in order to check
> whether there is a difference between x and y in terms of mean I need
> check the distribution of x and that of y in both natual and log scales
> and to see which present normal distribution?

See above for an approach to this: the answer to your question is,
in effect, "yes". It could of course have happened that neither the
raw nor the log scale would be satisfactory, in which case you would
need to consider other possibilities. And, if the SDs had turned out
to be very different, you should not use the standard t test but
a variant which is adpated to the situation (e.g. the Welch test).

You can, of course, also perform formal tests for skewness, for
normality, and for equality of variances.

Best wishes,
Ted.


--------------------------------------------------------------------
E-Mail: (Ted Harding) <Ted.Harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861   [NB: New number!]
Date: 22-Sep-04                                       Time: 12:07:07
------------------------------ XFMail ------------------------------




More information about the R-help mailing list