[R] scientific (statistical) foundation for Y-RANDOMIZATION in regression analysis

Mon Mar 8 15:44:38 CET 2010

That sounds like a particular form of permutation test.  If the
"scrambling" is replaced by sampling with replacement (i.e., some data
points can be sampled more than once while others can be left out),
that's the simple (or nonparametric) bootstrap.  The goal is to generate
the distribution of the statistic of interest (R^2 or q^2) under the
null hypothesis that there's no relationship between the activity (or
property) and the structure.

To make the "test" valid, one needs to ensure that the entire model
building process is carried through for all of the sampled data,
including feature selections, etc.

Andy

From: Damjan Krstajic
> 
> Dear all,
> 
> I am a statistician doing research in QSAR, building 
> regression models where the dependent variable is a numerical 
> expression of some chemical activity and input variables are 
> chemical descriptors, e.g. molecular weight, number of carbon 
> atoms, etc.
> 
> I am building regression models and I am confronted with a 
> widely a technique called Y-RANDOMIZATION for which I have 
> difficulties in finding references in general statistical 
> literature regarding regression analysis. I would be grateful 
> if someone could point me to papers/literature in statistical 
> regression analysis which give scientific (statistical) 
> foundation for using Y-RANDOMIZATION.
> 
> Y-RANDOMIZATION is a widely used technique in QSAR community 
> to unsure the robustness of a QSPR (regression) model. It is 
> used after the "best" regression model is selected and to 
> make sure that there are no chance correlations. Here is a 
> short description. The dependent variable vector (Y-vector) 
> is randomly shuffled and a new QSPR (regression) model is 
> fitted using the original independent variable matrix. By 
> repeating this a number of times, say 100 times, one will get 
> hundred R2 and q2 (leave one out cross-validation R2) based 
> on hundred shuffled Y. It is expected that the resulting 
> regression models should generally have low R2 and low q2 
> values. However, if the majority of hundred regression models 
> obtained in the Y-randomization have relatively high R2 and 
> high q2 then it implies that an acceptable regression model 
> cannot be obtained for the given data set by the current 
> modelling method.
> 
> I cannot find any references to Y-randomization or 
> Y-scrambling anywhere in the literature outside 
> chemometrics/QSAR. Any links or references would be much appreciated.
> 
> Thanks in advance.
> 
> DK
> ----------------------------------------------
> Damjan Krstajic
> Director
> Research Centre for Cheminformatics
> Belgrade, Serbia
> 
> ----------------------------------------------
> 
>  		 	   		  
> _________________________________________________________________
> Tell us your greatest, weirdest and funniest Hotmail stories
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:10}}