[R] Fw: Variable selection based on both training and testing data

Jin Minming jminming at yahoo.com
Mon Jan 30 17:30:54 CET 2012


Dear Scott,

I am so sorry that I think I just sent an empty email to you.
Thanks a lot for your advice.

The problem is that we do not have sufficient prior knowledge for the regression form and even appropriate inputs. We need try to find some possible regression equations, then add our explanation to them.  So we need explore a lot of options.  The two input datasets are very different in nature and they are from two locations.  Hence, it can be used for testing purpose although it may turn out to be that there is not an appropriate regression due to the intrinsic difference in these two datasets. 

In fact, if I can extract the models used (not only the final model) in stepAIC function, then it will be easier to add some simple scripts to calculate R2 or RMSE for both datasets. 

Thanks,

Jim


--- On Mon, 30/1/12, SR Millis <aa3379 at wayne.edu> wrote:

> From: SR Millis <aa3379 at wayne.edu>
> Subject: [R] Fw: Variable selection based on both training and testing data
> To: "r-help at r-project.org" <r-help at r-project.org>
> Date: Monday, 30 January, 2012, 14:57
> 
> 
> From: SR Millis <srmillis at yahoo.com>
> To: Jin Minming <jminming at yahoo.com>
> 
> Sent: Monday, January 30, 2012 9:25 AM
> Subject: Re: [R] Variable selection based on both training
> and testing data
>  
> 
> Jim,
> 
> First, stepwise methods for variable selection should be
> avoided.  Frank Harrell (in Regression Modeling Strategies)
> discusses this at length.
> 
> Second, splitting a dataset into training and validation
> sets is generally not a good idea unless you have a really
> large sample, eg, > 20,000.  As Harrell has discussed,
> split-sample validation does not provide external
> validation, is terribly inefficient, and is arbitrary. 
> It's better to specify your model a priori and use the
> bootstrap to obtain an estimate of your model's
> over-optimism.  Bootstrapping can be implemented with
> Harrell's rms package in R.
> 
> Scott
>  
> ~~~~~~~~~~~
> Scott R Millis, PhD, ABPP, CStat, PStat®
> Professor
> Wayne State University School of Medicine
> Email:  aa3379 at wayne.edu
> Email:  srmillis at yahoo.com
> Tel: 313-993-8085
> 
> 
> ________________________________
> 
> To: r-help at r-project.org
> 
> Sent: Monday, January 30, 2012 8:14 AM
> Subject: [R] Variable selection based on both training and
> testing data
> 
> Dear all,
> 
> The variable selection in regression is usually determined
> by the training data using AIC or F value, such as stepAIC.
> Is there some R package that can consider both the training
> and test dataset? For example, I have two separate training
> data and test data. Firstly, a regression model is obtained
> by using training data, and then this model is tested by
> using test data. This process continues in order to find
> some possible optimal models in terms of RMSE or R2 for both
> training and test data. 
> 
> Thanks,
> 
> Jim
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained,
>  reproducible code.
>     [[alternative HTML version deleted]]
> 
> 
> -----Inline Attachment Follows-----
> 
> ______________________________________________
> R-help at r-project.org
> mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible
> code.
>



More information about the R-help mailing list