# [Rd] discrepancy between lm and MASS:rlm

Mon Mar 14 18:36:42 CET 2011

Dear R-devel,

There seems to be a discrepancy in the order in which lm and rlm evaluate their arguments. This causes rlm to sometimes produce an error where lm is just fine.

Here is a little script that illustrate the issue:

> library(MASS)
> ## create data
> n <- 100
> dat <- data.frame(x=rep(c(-1,0,1), n), y=rnorm(3*n))
>
> ## call lm, works fine
> summary(lm(y ~ as.factor(x), data=dat, subset=x!=0))

Call:
lm(formula = y ~ as.factor(x), data = dat, subset = x != 0)

Residuals:
Min       1Q   Median       3Q      Max
-2.60619 -0.82160  0.06307  0.65501  2.56677

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept)   0.061010   0.100027   0.610    0.543
as.factor(x)1 0.001332   0.141459   0.009    0.992

Residual standard error: 1 on 198 degrees of freedom
Multiple R-squared: 4.479e-07,  Adjusted R-squared: -0.00505
F-statistic: 8.868e-05 on 1 and 198 DF,  p-value: 0.9925

> ## call rlm, error
> summary(rlm(y ~ as.factor(x), data=dat, subset=x!=0))
Error in rlm.default(x, y, weights, method = method, wt.method = wt.method,  :
'x' is singular: singular fits are not implemented in rlm
>

My guess is that rlm first converts x to a factor, which becomes a three-level factor, then subsets on x!=0, which effectively eliminates a level, and then creates a "regression" matrix, which becomes singular due to the absence of data for a level.

Is there a simple way to work around it. The simplest I could think of is
with(subset(dat, x!=0), rlm(y ~ as.factor(x))
which is ok, but most of my scripts make use of data arg to regressions and I'd like to stay consistent as much as practical.

Thanks,