[Rd] predict (PR#2686)

Tue Apr 1 10:48:49 MEST 2003

>> <Bravington wrote:>
>>>> `predict' complains about new factor levels, even if the 
>>>> "new" levels are
>>>> merely levels in the original that didn't occur in the 
>>>> original fit and were
>>>> sensibly dropped, and that don't occur in the prediction 
>>>> data either. 

>> <Ripley replied:>
>>> This is intentional.  The coding for factors is based on the 
>>> full set of 
>>> levels, and should be comparable for different prediction sets.
>>>
>>> If you are using factors with fictitious levels the fix is obvious: 
>>> improve the design.

>> <Bravington again:>
>> There is still an inconsistency bug between `lm' and `predict.lm',
though.
>> `lm' intentionally overlooks inactive levels of a factor, 

> <Ripley again:>
> Only if an argument is set, and originally lm did not do so.

<Bravington again:>
But `lm' always does this now, doesn't it? -- even if it didn't originally.
I think you can't not drop unused levels, even if you wanted to.

>> but `predict.lm'doesn't, even when it legitimately could. 
>> In particular, it is a bit odd to
>> have no problem predicting without a `newdata' argument even when the
>> original data had inactive factor levels, but then to get an error if
>> `newdata=<<original data>>' is supplied explicitly! (See example.)
>
> <Ripley:>
>Read again.  predict.lm is consistent across its inputs: 
>unlike lm it can
>take variable `newdata'.  As I said the intention is to be consistent
>across *prediction sets*.  Omitting newdata is not giving a prediction
>set.

<Bravington again:>
Mmm-- that's getting a bit metaphysical for me-- when is a prediction not a
prediction, and what is ``predict'' actually doing if it is not predicting?!

Anyhow, according to the help page for `predict.lm':

     If the fit is rank-deficient, some of the columns of the design
     matrix will have been dropped.  Prediction from such a fit only
     makes sense if `newdata' is contained in the same subspace as the
     original data. That cannot be checked accurately, so a warning is
     issued.

The subspace condition is obviously satisfied if the prediction data is the
same as the original data-- so prediction does "make sense" in that context
according to the documentation (as well as common sense. Normally I am no
fan of slavish adherence to documentation, but in my own interests I'll make
an exception...). And yet there's an error message, not even a warning.

Prediction from the original data was just an example, of course; my general
proposal is that inactive factor levels in the prediction set should be
dropped. I don't see how this could ever cause inconsistent behaviour across
prediction sets-- have I missed something?

cheers
Mark

*******************************

Mark Bravington
CSIRO (CMIS)
PO Box 1538
Castray Esplanade
Hobart
TAS 7001

phone (61) 3 6232 5118
fax (61) 3 6232 5012
Mark.Bravington at csiro.au