[R] step, leaps, lasso, LSE or what?

Fri Mar 1 12:22:08 CET 2002

On Fri, 1 Mar 2002, Jari Oksanen wrote:

> ripley at stats.ox.ac.uk said:
> > A second difference is the purpose of selecting a model.  AIC is
> > intended to select a model which is large enough to include the `true'
> > model, and hence to give good predictions.  There over-fitting is not
> > a real problem. (There are variations on AIC which do not assume some
> > model considered is true.)   This is a different aim from trying to
> > find the `true' model or trying to find the smallest adequate model,
> > both aims for explanation not prediction.
>
> This may be a stupid question, but perhaps I won't be lashed if I
> confess my stupidity as a preventive measure. About minimal adequate
> model*s*:  Murray Aitkin et al. have a book called "Statistical
> Modelling in GLIM" (Ox UP, 1989) where they tell how to find a set
> adequate models in glm (with GLIM), and how one or *several* of these
> adequate models may be minimal.  When I read the book, I found this as
> an attractive concept since it showed that you may have several about
> equally good models with different terms, although usual selection
> procedures (including best subsets) finds only one.  I have quite often
> seen people to use automatic selection in several subsets and then
> saying that subsets are different because different regressors were
> selected -- although the same regressors could have been about as good,
> but they were never evaluated.

The concept is well-known: Cox for example stresses finding sets of small
adequate models.  That's yet a different aim, as often only one explanation
is required (or accepted).  There is a lot on sets of adequate models:
Raftery's Occam's window for example (see the reference in my first post).

> Now the question: Aitkin's procedure would be very easy to perform in R
> (well, it was easy even in dear old GLIM!), but I have hardly seen it

`dear' as in `expensive' is my memory.

> used. Is there a reason for this? Is there something dubious in minimal
> adequate modles that makes tehm a no-no, an Erlkönig that catches us
> innocent children?

Not in general, but the lack of adoption of the method is a fair indication
of how it was respected.  I've now forgotten the technical flaws.

> Bibliographic note: I know the procedure from the Aitkin et. al. book,
> and haven't checked the original references. These sources are cited in
> the book:
>
> Aitkin, M. A. 1974. Simultaneous inference and choice of variable
> subsets in multiple regression. Technometrics 16, 221--227.
>
> Aitkin, M. A. 1978. The analysis of unbalanced cross-classification
> (with Discussion). J. Roy. Stat. Soc. A 141,
> 195--223.

I suggest you do read that paper, especially the discussion.  I use it as a
case study in my MSc class on how *not* to do model selection.  It's a very
good illustration of many of the points of my first posting against fully
automated procedures.

There are several analyses of that example in MASS, with alternative
models selected and spotting many things that Aitkin overlooked.  Do read
Bill Venables' commentaries in MASS too.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._