[Rd] R and Gnumeric

Mon Jun 9 16:34:53 CEST 2008

Jean Bréfort wrote:
> One other totally unrelated thing. We got recently a bug report about an
> incorrect R squared in gnumeric regression code
> (http://bugzilla.gnome.org/show_bug.cgi?id=534659). R (version 2.7.0)
> give the same result as Gnumeric as can be seen below:
>
>   
>> mydata <- read.csv(file="data.csv",sep=",")
>> mydata
>>     
>   X  Y
> 1 1  2
> 2 2  4
> 3 3  5
> 4 4  8
> 5 5  0
> 6 6  7
> 7 7  8
> 8 8  9
> 9 9 10
>   
>> summary(lm(mydata$Y~mydata$X))
>>     
>
> Call:
> lm(formula = mydata$Y ~ mydata$X)
>
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -5.8889  0.2444  0.5111  0.7111  2.9778 
>
> Coefficients:
>             Estimate Std. Error t value Pr(>|t|)  
> (Intercept)   1.5556     1.8587   0.837   0.4303  
> mydata$X      0.8667     0.3303   2.624   0.0342 *
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>
> Residual standard error: 2.559 on 7 degrees of freedom
> Multiple R-squared: 0.4958,	Adjusted R-squared: 0.4238 
> F-statistic: 6.885 on 1 and 7 DF,  p-value: 0.03422 
>
>   
>> summary(lm(mydata$Y~mydata$X-1))
>>     
>
> Call:
> lm(formula = mydata$Y ~ mydata$X - 1)
>
> Residuals:
>     Min      1Q  Median      3Q     Max 
> -5.5614  0.1018  0.3263  1.6632  3.5509 
>
> Coefficients:
>          Estimate Std. Error t value Pr(>|t|)    
> mydata$X   1.1123     0.1487   7.481 7.06e-05 ***
> ---
> Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
>
> Residual standard error: 2.51 on 8 degrees of freedom
> Multiple R-squared: 0.8749,	Adjusted R-squared: 0.8593 
> F-statistic: 55.96 on 1 and 8 DF,  p-value: 7.056e-05 
>
> I am unable to figure out what this 0.8749 value might represent. If it
> is intended to be the Pearson moment, it should be 0.4958, and if it is
> the coefficient of determination, I think the correct value would be
> 0.4454 as given by Excel. It's of course nice to have the same result in
> R and Gnumeric,but it would be better if this result was accurate (if it
> is, we need some documentation fix). Btw, I am not a statistics expert
> at all.
>   
This horse has been flogged multiple times on the list.

It is of course mainly a matter of convention, but the convention used
by R has been around at least since Genstat in the mid-1970s. In the
no-intercept case, you get the _uncentered_ version of R-squared; that
is, the proportion of the sum of squares explained by the model (as
opposed to sum of squares of _deviations_ in the usual case.) The
rationale is that the R^2 should be based on a reduction in residual
variation between two nested models, and if theres no intercept, the
only well-determined nested model is the one where mydata$Y has mean
zero for all x corresponding to all-zero regression coefficients. The
resulting R^2 is directly related to the F statistic, which you'll see
is also larger and more significant when the intercept is removed.

BTW:  lm(mydata$Y~mydata$X) is bad practice, use lm(Y~X, data=mydata).
Use of predict() will demonstrate why.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907