[R] party for prediction [REPOST]
Ed
icelus2k5 at gmail.com
Mon Oct 15 00:43:16 CEST 2012
This was an exceptionally helpful answer, I can only thank you again.
I have plenty of avenues ahead where I was worried before I was
getting trapped in a dead end. If all else fails, the idea of using
anova is brilliant. Thank you!
Ed
On 14 October 2012 18:36, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
> On Sun, 14 Oct 2012, Ed wrote:
>
>> First up, thanks hugely for your response. I've been beating my head
>> against this!
>>
>> On 14 October 2012 16:51, Achim Zeileis <Achim.Zeileis at uibk.ac.at> wrote:
>>>
>>> I'm not sure what you mean by "integral vector". If you want to apply the
>>> approach to hundreds of thousands of observations, I gues that these are
>>> categorical (maybe even binary?) but maybe not...
>>
>>
>> I'm sorry I can't go into the details of the data, I would if I could.
>> z are categorical variables represented as integers, mostly ordered,
>> but not all. I've tried fitting them as integers, as well as ordered,
>> but O don't think it made a huge difference.
>
>
> The tests performed for categorical partitioning variables are rather
> different from the tests for numerical partitioning variables. If all of the
> variables are categorical, this may not be immediately obvious, but the
> factor coding should be more appropriate (especially if the number of levels
> is small or moderate).
>
>
>>> If I recall correctly, we kept linearModel as simple as we did to save as
>>> much time as possible. This can be particularly important when one of the
>>> partitioning variables has many possible splits and the linearModel has
>>> to
>>> be fitted thousands of times.
>>
>>
>> I can appreciate that, but maybe having an alternative linearModel which
>> will predict when the fit is degenerate would be worth including? I'm happy
>> to contribute what I have, although it's pretty obvious stuff (and probably
>> done suboptimally since I'm not much of an R coder at this point). For me at
>> least, even with huge datasets, the speed of party is quite good; it's
>> getting a better result that's the problem.
>
>
> As I explained in my last e-mail. In your situation this does not solve the
> problem completely because subsequent the tests are also not adapted to
> this. Setting the empirical estimating functions to zero for non-identified
> coefficients might alleviate the problem but is not really a clean solution.
>
>
>>> Also, mob() assesses the stability of all coefficients of the model in
>>> all
>>> nodes during partitioning. If any of the coefficients is not identified,
>>> this would have to be excluded from all subsequent parameter stability
>>> tests
>>> in that node (and its child nodes). This is currently not provided for in
>>> mob().
>>
>>
>> Would pretending the coefficients were fit at 0 fool mob into doing
>> something moderately meaningful here?
>
>
> The coefficients are not looked at during fitting, only the estfun(). This
> would have to be set to 0.
>
>
>> If not, I would try to hack the code, but I'm honestly at something of
>> a loss as to how to modify it and feed the results back into my
>> interpreter. I have bytecode installed; I downloaded the source, but I
>> haven't squared the circle of modifying the source and installing the
>> result. I will check out the docs on writing extensions you suggest.
>
>
> Writing (or modifying) R packages and installing them under Windows is
> pretty standard and well documented. The pointers I gave you should
> hopefully get you started.
>
>
>>>> The second problem I have is that I get "Cholesky not positive definite"
>>>> errors at some nodes. I guess this is because of numerical error and
>>>> degeneracy in the covariance matrix? Any thoughts on how to avoid having
>>>> this happen would be welcome; it is ignorable though for now.
>>>
>>>
>>> This comes from the parameter stability tests and might be a result of an
>>> unidentified (or close to unidentified) model fit.
>>
>>
>> This is a great help to know. I improved my results quite considerably
>> with aggressive scaling of everything (scaling the response and all
>> the predictors to lie between 0 an 1). That deepened my tree by a
>> factor of two or so (say depth 3 to 7) and improved the quality of fit
>> substantially. Is there any way I can engage a more numerically robust
>> Cholesky in mob?
>
>
> No, I don't think that this is conceivable with the way this is implemented
> at the moment. Instead of the currently implemented tests, one could in
> principle use the likelihood ratio test which is invariant against parameter
> transformations and doesn't need the covariance matrix.
>
>
>>> With hundreds of thousands of observations, you would need some
>>> additional
>>> pruning strategy anyway. Significance test-based splitting will probably
>>> overfit because tiny differences in the coefficients will be picked up at
>>> such large sample sizes.
>>
>>
>> I'm okay with overfitting, honestly. At the moment it is underfitting
>> by quite a large amount I think (the quality of the predictions on the
>> training set is not very high). The problem really is there is so much
>> going on the data, but the "noise" level is probably very low. I
>> wouldn't be surprised if my data was accurate to 5 or 6 s.f.
>
>
> OK. Maybe with so little noise some other splitting strategy (not based on
> significance tests) would be better?
>
>
>>> Furthermore, computationally the extensive search over all possible
>>> splits
>>> might be too burdensome with this many observations.
>>
>>
>> I have plenty of compute time/power and RAM, though R seems to be
>> running single threaded. But even on a few million observations, it's
>> still pretty fast and doesn't use more than 30 or so gig of memory. If
>> it takes a day and requires 150gig of RAM, that is absolutely fine,
>> even over that would be viable though less optimal.
>
>
> With this setup, you may consider writing your own partitioning algorithm
> using the same type of ideas as MOB. Instead of using the parameter
> stability tests, you could use plain likelihood ratio tests or ANOVAs to
> decide about splitting further.
>
> If d is the data in the current node, y and x are response/regressors, and
> f1 to fn are your categorical partitioning variables, you could do
>
> m0 <- lm(y ~ x, data = d)
> m1 <- lm(y ~ f1 * x, data = d)
> ...
> mn <- lm(y ~ fn * x, data = d)
>
> and then
>
> anova(m0, m1)
> ...
> anova(m0, mn)
>
> and then choose the lowest p-value for splitting. You could then also
> parallelize the computations in the daughter nodes.
>
> None of this is readily coded in party though...
>
>
>>> Hence, using some subsampling strategy might not be the worst thing.
>>
>>
>> I've tried this at various degrees, but the data is really very
>> complicated with not a lot of error. I'm trying to encourage party to
>> fit more closely, which I thought more data might encourage. At the
>> moment I'm a long way from a clean fit. I have subsampled at various
>> levels down to 1%, and although that increases the depth of the tree
>> and quality of fit, it still doesn't give a very good quality fit and
>> can encourage it to overlook obvious aspects of the training set.
>>
>>>> I guess what I really want to know is:
>>>> (a) has anyone else had this problem, and if so how did they overcome
>>>> it?
>>>
>>>
>>>
>>> We have had non-identified model fits in binary GLMs (with quasi-complete
>>> separation) where we then set estfun() to all zero so that partitioning
>>> stops. But I don't think that such a strategy helps here.
>>
>>
>> I've considered using rpart() to partition into cells of constant
>> gradient, then fitting linear models myself to the cells. This is my
>> next thought. I'm pretty sure partitioning over linear regression is
>> the way forward for the data we have. I tried mars and glm but there
>> are good reasons to think they're less reasonable, even though the fit
>> wasn't particularly poor. I'm not particularly wedded to party's
>> approach except that it looked like it immediately returned what we
>> needed, and with some degree of "optimality" into the bargain.
>
>
> Possibly boosting linear models may be another route to go.
>
>
>>>> (b) is there any way to get a line or stack trace out of a try()
>>>> without source modification?
>>>
>>>
>>> Not sure, I don't know any off the top off my head.
>>
>>
>> I guess I really will have to bite the bullet and try to figure out
>> how to install modified libraries. Thanks.
>
>
> I think that's easier compared to solving the conceptual problems (thinking
> about how to best partition degenerate linear models etc.).
> ;-)
>
> Good luck!
> Best wishes,
>
> Z
>
>>>> (c) failing all of that, does anyone know of an alternative to mob
>>>> that does the same thing; for better or worse I'm now committed to
>>>> recursive partitioning over linear models, as per mob?
>>>
>>>
>>> If your partitioning variables are particularly simple (e.g., all binary)
>>> you could exploit that and it may be easier to write a custom function
>>> for
>>> your particular data. Then likelihood-ratio tests (rather than LM-type
>>> tests) would also be easier to apply in case of unidentified parameters.
>>>
>>> But if there are partitioning variables with different measurement
>>> scales,
>>> then this will not be that simple...
>>
>>
>> Unfortunately each partitioning variable is essentially a state
>> indicator, taking values say 0,...,R where R is different for each
>> component. I'm not a stats expert either; I've spent some time with
>> the party manuals and papers, but I wouldn't be confident of
>> implementing something like it in the time available to me (though if
>> I have to I will, but that wouldn't be a good situation to be in).
>>
>>>> (d) failing all of this, does anyone have a link to a way to rebuild, or
>>>> locally modify, an R package (preferably windows, but anything would
>>>> do)?
>>>
>>>
>>>
>>> Have a look at the "Writing R Extensions" manual and the R for Windows
>>> FAQ.
>>
>>
>> Will do.
>>
>> Thank you very much for your responses, I really appreciate it.
>>
>> Best wishes,
>>
>> Ed
>>
>
More information about the R-help
mailing list