[R] party for prediction [REPOST]

Sun Oct 14 17:51:24 CEST 2012

Ed:

> I'm experiencing some problems using the party package (specifically 
> mob) for prediction. I have a real scalar y I want to predict from a 
> real valued vector x and an integral vector z. mob seemed the ideal 
> choice from the documentation.

I'm not sure what you mean by "integral vector". If you want to apply the 
approach to hundreds of thousands of observations, I gues that these are 
categorical (maybe even binary?) but maybe not...

> The first problem I had was at some nodes in a partitioning tree, the 
> components of x may be extremely highly correlated or effectively 
> constant (that is x are not independent for all choices of components of 
> z). When the resulting fit is fed into predict() the result is NA - this 
> is not the same behaviour as models returned by say lm which ignore 
> missing coefficients. I have fixed this by defining my own statsModel 
> (myLinearModel - imaginative) which also ignores such coefficients when 
> predicting.

If I recall correctly, we kept linearModel as simple as we did to save as 
much time as possible. This can be particularly important when one of the 
partitioning variables has many possible splits and the linearModel has to 
be fitted thousands of times.

Also, mob() assesses the stability of all coefficients of the model in all 
nodes during partitioning. If any of the coefficients is not identified, 
this would have to be excluded from all subsequent parameter stability 
tests in that node (and its child nodes). This is currently not provided 
for in mob().

> The second problem I have is that I get "Cholesky not positive definite" 
> errors at some nodes. I guess this is because of numerical error and 
> degeneracy in the covariance matrix? Any thoughts on how to avoid having 
> this happen would be welcome; it is ignorable though for now.

This comes from the parameter stability tests and might be a result of an 
unidentified (or close to unidentified) model fit.

> The third and really big problem I have is that when I apply mob to
> large datasets (say hundreds of thousands of elements) I get a
> "logical subscript too long" error inside mob_fit_fluctests. It's
> caught in a try(), and mob just gives up and treats the node as
> terminal. This is really hurting me though; with 1% of my data I can
> get a good fit and a worthwhile tree, but with the whole dataset I get
> a very stunted tree with a pretty useless prediction ability.

With hundreds of thousands of observations, you would need some additional 
pruning strategy anyway. Significance test-based splitting will probably 
overfit because tiny differences in the coefficients will be picked up at 
such large sample sizes.

Furthermore, computationally the extensive search over all possible splits 
might be too burdensome with this many observations.

Hence, using some subsampling strategy might not be the worst thing.

> I guess what I really want to know is:
> (a) has anyone else had this problem, and if so how did they overcome it?

We have had non-identified model fits in binary GLMs (with quasi-complete 
separation) where we then set estfun() to all zero so that partitioning 
stops. But I don't think that such a strategy helps here.

> (b) is there any way to get a line or stack trace out of a try()
> without source modification?

Not sure, I don't know any off the top off my head.

> (c) failing all of that, does anyone know of an alternative to mob
> that does the same thing; for better or worse I'm now committed to
> recursive partitioning over linear models, as per mob?

If your partitioning variables are particularly simple (e.g., all binary) 
you could exploit that and it may be easier to write a custom function for 
your particular data. Then likelihood-ratio tests (rather than LM-type 
tests) would also be easier to apply in case of unidentified parameters.

But if there are partitioning variables with different measurement scales, 
then this will not be that simple...

> (d) failing all of this, does anyone have a link to a way to rebuild, or 
> locally modify, an R package (preferably windows, but anything would 
> do)?

Have a look at the "Writing R Extensions" manual and the R for Windows 
FAQ.

Best,
Z

> Sorry for the length of this post. If I should RTFM, please point me
> at any relevant manual by all means. I've spent a few days on this as
> you can maybe tell, but I'm far from being an R expert.
>
> Thanks for any help you can give.
>
> Best wishes,
>
> Ed
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>