[R] ordering of factor levels in regression changes result

Tulinsky, Thomas TTulinsky at DIRECTV.com
Fri Feb 3 22:16:54 CET 2012


I was surprised to find that just changing the base level of a factor variable changed the number of significant coefficients in the solution.

I was surprised at this and want to know how I should choose the order of the factors, if the order affects the result.

Here is the small example. It is taken from 'The R Book', Crawley  p. 365. 

The data is at 
http://www.bio.ic.ac.uk/research/mjcraw/therbook/data/competition.txt

In R

   > comp<-read.table("C:\\Temp\\competition.txt", header=T)

   > attach(comp)

Data has dependent variable 'biomass' and different types of 'clipping' that were done:
Control (none), n25, n50, r10, r5:

   >  summary(comp)
       biomass         clipping
   Min.   :415.0   control:6  
    1st Qu.:508.8   n25    :6  
    Median :568.0   n50    :6  
    Mean   :561.8   r10    :6  
    3rd Qu.:631.8   r5     :6  
    Max.   :731.0              

List mean Biomass of each type of Clipping:

   > aggregate (comp$biomass, list (comp$clipping) , mean)
     Group.1        x
    control 465.1667
        n25 553.3333
        n50 569.3333
        r10 610.6667
         r5 610.5000

do regression - get same result as book p. 365
Clipping type 'control' is not in list of coefficients because it is first alphabetically so it is folded into Intercept

   > model<-lm(biomass ~ clipping)
   > summary(model)

   Call:
   lm(formula = biomass ~ clipping)

   Residuals:
        Min       1Q   Median       3Q      Max 
   -103.333  -49.667    3.417   43.375  177.667 

   Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
   (Intercept)   465.17      28.75  16.177  9.4e-15 ***
   clippingn25    88.17      40.66   2.168  0.03987 *  
   clippingn50   104.17      40.66   2.562  0.01683 *  
   clippingr10   145.50      40.66   3.578  0.00145 ** 
   clippingr5    145.33      40.66   3.574  0.00147 ** 
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

   Residual standard error: 70.43 on 25 degrees of freedom
   Multiple R-squared: 0.4077,     Adjusted R-squared: 0.3129 
   F-statistic: 4.302 on 4 and 25 DF,  p-value: 0.008752 


Relevel - make 'n25' the base level of Clipping:

   > comp$clipping <- relevel (comp$clipping, ref="n25")

   > summary(comp)
       biomass         clipping
    Min.   :415.0   n25    :6  
    1st Qu.:508.8   control:6  
    Median :568.0   n50    :6  
    Mean   :561.8   r10    :6  
    3rd Qu.:631.8   r5     :6  
    Max.   :731.0              

Redo LM with releveled data

   > modelRelev<-lm(biomass~clipping, data=comp)

Different results. (Some parts, Residuals and Std Errors,  are the same)
Especially note the Pr and Signifcance columns are different.

   > summary(modelRelev)

   Call:
   lm(formula = biomass ~ clipping, data = comp)

   Residuals:  
        Min       1Q   Median       3Q      Max 
   -103.333  -49.667    3.417   43.375  177.667 

   Coefficients:
                   Estimate Std. Error t value Pr(>|t|)    
   (Intercept)       553.33      28.75  19.244   <2e-16 ***
   clippingcontrol   -88.17      40.66  -2.168   0.0399 *  
   clippingn50        16.00      40.66   0.393   0.6973    
   clippingr10        57.33      40.66   1.410   0.1709    
   clippingr5         57.17      40.66   1.406   0.1721    
   ---
   Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 

   Residual standard error: 70.43 on 25 degrees of freedom
   Multiple R-squared: 0.4077,     Adjusted R-squared: 0.3129 
   F-statistic: 4.302 on 4 and 25 DF,  p-value: 0.008752



More information about the R-help mailing list