[R] Neural Nets (nnet) - evaluating success rate of predictions

Tue May 8 08:31:07 CEST 2007

On 5/7/07, Bert Gunter <gunter.berton at gene.com> wrote:
> Folks:
>
> If I understand correctly, the following may be pertinent.
>
> Note that the procedure:
>
> min.nnet = nnet[k] such that error rate of nnet[k] = min[i] {error
> rate(nnet(training data) from ith random start) }
>
> does not guarantee a classifier with a lower error rate on **new** data than
> any single one of the random starts. That is because you are using the same
> training set to choose the model (= nnet parameters) as you are using to
> determine the error rate. I know it's tempting to think that choosing the
> best among many random starts always gets you a better classifier, but it
> need not. The error rate on the training set for any classifier -- be it a
> single one or one derived in some way from many -- is a biased estimate of
> the true error rate, so that choosing a classifer on this basis does not
> assure better performance for future data. In particular, I would guess that
> choosing the best among many (hundreds/thousands) random starts is probably
> almost guaranteed to produce a poor predictor (ergo the importance of
> parsimony/penalization).  I would appreciate comments from anyone, pro or
> con, with knowledge and experience of these things, however, as I'm rather
> limited on both.

I agree - it's never a good idea to use the same data for creating
your classifier and determining it's effectiveness (I meant to say
"pick the one with the lowest error rate on your TEST data").

The reason to choose from many random starts is that fitting a given
neural network _model_ (ie. input x, n nodes, ...) is very hard due to
the large overparameterisation of the problem space.  For example, the
parameters for one node in a given layer can be exchanged with the
parameters of another node (as well as the parameters that use those
nodes in next layer), without changing the overall model.  This makes
it very hard to optimise, and nnet in R often gets stuck in local
minima.
Looking at what individual nodes are doing, you often see examples
where some nodes contribute nothing to the overall classification.
The random starts aren't to find different models but to find the
parameters for the given model that fits best.  And following this
line of argument, you would probably want to use the internal
criterion value, rather than some external measure of accuracy.

> The simple answer to the question of obtaining the error rate using
> validation data is: Do whatever you like to choose/fit a classifier on the
> training set. **Once you are done,** the estimate of your error rate is the
> error rate you get on applying that classifier to the validation set. But
> you can do this only once! If you don't like that error rate and go back to
> finding a a better predictor in some way, then your validation data have now
> been used to derive the classifier and thus has become part of the training
> data, so any further assessment of the error rate of a new classifier on it
> is now also a biased estimate. You need yet new validation data for that.

Understanding that that estimate is biased is important, but in
practice, do people really care that much?  If you have looked at a
single plot of your data and used that to inform your choice of the
classifier your estimates will already be biased (but if you have used
other knowledge of the data or subject area, you might expect them to
be biased in a positive direction). Are the estimates of model really
the most important thing?  Surely an understanding of the problem/data
is what you are really trying to gain.

Hadley