[R] How to estimate whether overfitting?
Frank E Harrell Jr
f.harrell at Vanderbilt.Edu
Mon May 10 14:58:51 CEST 2010
On 05/10/2010 12:32 AM, bbslover wrote:
> many thanks . I can try to use test set with 100 samples.
> anther question is that how can I rationally split my data to training set
> and test set? (training set with 108 samples, and test set with 100 samples)
> as I know, the test set should the same distribute to the training set. and
> what method can deal with it to rationally split?
> and what packages in R can deal with splitting training/test set rationally
> if the split is random. it seems to need many times splits, and the average
> results consider as the final results.
> however, I want to several methods to perform split and get the firm
> training set and test set instead of random split.
> training set and test set should like this：ideally, the division must be
> performed sunch that points representing both traing and training set are
> distributed within the hole feature space occupied by the entire dataset,
> and each point of the test set is close to at least one point of the
> training set. this approach ensures that the similarity principle can be
> enmployed for the output prediction of the test set. Certainly,this
> condition can not always be satistied.
> thus, generally, what algorithms often be perform to split? and more
> rational? some paper often say, they split the data set randomly, thus,
> what is randomly? just selection random? or have some clear method? e.g.
> output order, I really know, which package can do with split data
> other, if one want to get the better results, some "tips" can be done. e.g.
> they can select test set again and again, and use the test set with best
> results as final test set and say that the test set was selectd randomly,
> but it is not true random, it is false.
> thank you, sorry to so many questions. but it puzzled me always. up to now,
> I have no good method to split rationally my data into training set and test
> at last, split training and test set should be done before modeling, and it
> seems that this can be done just from featrue? (som) ( or feature and
> output?(alogorithm spxy. paper:"a method for calibration and validation
> subset partioning") or just output?(output order)).
> but always, often there are many features to be calculated. and some featrue
> is zero or low standard deviation(sd<0.5), should we delete these features
> before split the whole data?
> and use the remaining feature to split data, and just using the training set
> to build the regression model and to perform feature selection as well as to
> do cross-validation, and the independent test set just used to test the
> built model, yes?
> maybe, my thinking is not clear about the whole model precess. but I think
> it is like this:
> 1) get samples
> 2) calculate features
> 3) preprocess features calculated (e.g.remove zero)
> 4)rational split data into training and test set (always puzzle me, how to
> split on earth?)
> 5)build model and at the same time tune parameter of model based on the
> resample methods using just training set. and get the final model.
> 6) test the model performance using independent test set (unseen samples).
> 7) estimate the model. good? or bad? overfitting? (generally, what case is
> overfitting? can you give me a example? as i know, it is overfitting when
> the trainging set fit good, but the independent test set is bad,but what is
> good ? what is bad? r2=0.94 in the training set and r2=0.70 in the test,
> in this case, the model is overfitting? the model can be accepted? and
> generally what model can be well accetpt?)
> 8) conclusion. how is the model.
> above is my thinking. and many question wait for answering.
Kevin: I'm sorry I don't have time to deal with such a long note, but
briefly data splitting is not a good idea no matter how you do it unless
N > perhaps 20,000. I suggest resampling, e.g., either the bootstrap
with 300 resamples or 50-fold repeats of 10-fold cross-validation.
Among other places these are implemented in my rms package.
Frank E Harrell Jr Professor and Chairman School of Medicine
Department of Biostatistics Vanderbilt University
More information about the R-help