[R] Surprising results from summary(lm()) on data with NO variation

Warnes, Gregory R gregory_r_warnes at groton.pfizer.com
Fri Dec 13 21:34:03 CET 2002


I have some data (from Affymetrix experiments) where I fit an aov() model to
a large number of outcome variables.  A reasonable fraction of the outcome
variables have 0 variability because values below a cutoff have been
replaced with the cutoff value (often 20) .  

In this case, the overall p-value from summary(lm(..))  is misleadingly
small:

For example (equivalent results in R 1.6.1 on both Solaris and for Windows
2000):


	> x <- rep(20, 4+5+5)  # no variation
	> Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) ))  #
3 treatment levels
	> summary( lm( x ~ Treatment)  )
	
	Call:
	lm(formula = x ~ Treatment)
	
	Residuals:
	       Min         1Q     Median         3Q        Max 
	-1.994e-14  7.889e-31  7.889e-31  7.889e-31  6.647e-15 
	
	Coefficients:
	               Estimate Std. Error   t value Pr(>|t|)    
	(Intercept)   2.000e+01  3.471e-15 5.762e+15   <2e-16 ***
	TreatmentHigh 6.647e-15  4.657e-15 1.427e+00    0.181    
	TreatmentLow  6.647e-15  4.657e-15 1.427e+00    0.181    
	---
	Signif. codes:  0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1 
	
	Residual standard error: 6.942e-15 on 11 degrees of freedom
	Multiple R-Squared: 0.5333,     Adjusted R-squared: 0.4485 
	F-statistic: 6.286 on 2 and 11 DF,  p-value: 0.01512
                                          ^^^^^^^^^^^^^^^^

Note F statistic and the overall p-value.     

For some values (eg 0 and 7),  fitting this model will generate NA's for the
estimates.  For most values, we get similar results.  I presume that this is
due to numerical problems in the fitting algorithm.  Still the result is
quite surprising.

It appears that summary.aov doesn't have the same problem:

	> x <- rep(20, 4+5+5)  # no variation
	> Treatment <- factor(rep( c("Control","Low","High"), c(4,5,5) ))  #
3 treatment levels
	> summary( aov( x ~ Treatment)  )
	            Df     Sum Sq    Mean Sq F value Pr(>F)
	Treatment    2 1.2622e-28 6.3110e-29  1.3095 0.3089
	Residuals   11 5.3011e-28 4.8190e-29               

Perhaps the mechanism for computing the F statistic that summary.aov uses is
less susceptible to the 'no variability' problem than the computation in
summary.lm.

I'm posting this here  to raise a warning flag.  For my particular
application, I'll just check for the no variability case and handle it
separately.  I'll leave it to the R core folks to decide if this is a big
enough problem to do something about in the code.

-Greg


LEGAL NOTICE\ Unless expressly stated otherwise, this message is ... [[dropped]]




More information about the R-help mailing list