[R] goodness of "prediction" using a model (lm, glm, gam, brt, regression tree .... )

Thu Sep 3 07:56:59 CEST 2009

Dear R-friends,

How do you test the goodness of prediction of a model, when you predict on a 
set of data DIFFERENT from the training set?

I explain myself: you train your model M (e.g. glm,gam,regression tree, brt) 
on a set of data A with a response variable Y. You then predict the value of 
that same response variable Y on a different set of data B (e.g. predict.glm, 
predict.gam and so on). Dataset A and dataset B are different in the sense that 
they contain the same variable, for example temperature, measured in different 
sites, or on a different interval (e.g. B is a subinterval of A for 
interpolation, or a different interval for extrapolation). If you have the 
measured values for Y on the new interval, i.e. B, how do you measure how good 
is the prediction, that is how well model fits the Y on B (that is, how well 
does it predict)?

In other words:

Y~T,data=A for training
Y~T,data=B for predicting

I have devised a couple of method based around 1) standard deviation 2) R^2, 
but I am unhappy with them.

Regards 
-- 
Corrado Topi

Global Climate Change & Biodiversity Indicators
Area 18,Department of Biology
University of York, York, YO10 5YW, UK
Phone: + 44 (0) 1904 328645, E-mail: ct529 at york.ac.uk