[R] ppr, number of terms, and data ordering

Wed Jun 6 18:33:44 CEST 2001

On Wed, 6 Jun 2001 david.beede at mail.doc.gov wrote:

>
> Dear R listers --
>
> I have several questions about using the ppr command in the modreg module.
>
> I discovered -- quite by accident -- that if I re-order the data, I obtain different
> results.  The output below shows what I mean.  I have two datasets (dataset1 and dataset2)
> that are identical (tested using proc compare in SAS) except for the fact that the records
> are in different order.  Below I have pasted in the results from running ppr on the two
> data sets in their original order (first and third sets of results below) and running
> ppr after sorting the datasets by idnum (second and third sets of results below).

Because there can be multiple local minima in the fit criterion, this is
not at all unusual.  Your mistake is assuming that there is a single
possible answer.  Same with nnet, and ppr is a more general form of
neural nets.  Not so widely known, but true of gams too.

Having said that, I think you are vastly over-fitting.  The usual idea is
to have fewer terms than explanatory variables, or not many more.

> At first I thought that perhaps the regression parameters are different but the underlying
> results are equivalent but predicted values are significantly different.  I tried
> increasing the bass parameter, thinking that perhaps I was overfitting the data,
> but the differences in the regression parameters remained.  Finally, I originally had lots
> of other RHS variables, including indicator variables; stripping those variables out
> did not change my findings, as shown below.
>
> My first question is:  is there a recommended way to sort the data before running ppr?
> In the meantime, I'll try sorting by my two continuous RHS variables to see if it makes
> a difference -- not a definitive answer but it may be suggestive.

No.  But using a smoother other than supsmu (like smoothing splines)
is usually a better idea.  If there is a sensible model then most
orderings (or machines or compiler levels of optimization ...) will give
similar fits.  Seeing instability is probably a sign that the model is
not good.

> My second question is:  is my method for deciding on the number of terms in the regression
> okay?  What I am doing is first running ppr with a large maximum number of terms, then
> finding the number of terms that minimizes the goodness-of-fit statistic.  Looking at the
> cpus example in the section of MASS that deals with ppr (pp. 293-294), it is unclear why
> eight terms were finally chosen, when using ten terms yields a lower goodness-of-fit
> statistic.

Because in theory the goodness of fit decreases (numerically) with the
number of terms, although the optimization is inexact.  One is looking for
a break in slope in the decrease.

> Finally, in the same example in MASS, where does the test.cpus() function come from?  I
> couldn't find it in the MASS table of contents on CRAN.

Because it is in MASS the book, not MASS the package.  Try the
scripts in the MASS package, specifically ch06.R or ch09.R.

Probably you need to read a lot more about the background, although the
papers are scattered and often inaccessible.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._