[Rd] application to mentor syrfr package development for Google Summer of Code 2010

James Salsman jsalsman at talknicer.com
Mon Mar 8 08:49:36 CET 2010


Chillu, I meant that development on both a syrfr R package capable of
using either F statistics or parametric derivatives should proceed in
parallel with your work on such a derivatives package. You are right
that genetic algorithm search (and general best-first search --
http://en.wikipedia.org/wiki/Best-first_search -- of which genetic
algorithms are various special cases) can be very effectively
parallelized, too.

In any case, thank you for pointing out Eureqa --
http://ccsl.mae.cornell.edu/eureqa -- but I can see no evidence there
or in the user manual or user forums that Eureqa is considering
degrees of freedom in its goodness-of-fit estimation.  That is a
serious problem which will typically result in invalid symbolic
regression.  I am sending this message also to Michael Schmidt so that
he might be able to comment on the extent to which Eureqa adjusts for
degrees of freedom in his fit evaluations.

Best regards,
James Salsman

On Sun, Mar 7, 2010 at 10:39 PM, Chidambaram Annamalai
<quantumelixir at gmail.com> wrote:
>
>> If I understand your concern, you want to lay the foundation for
>> derivatives so that you can implement the search strategies described
>> in Schmidt and Lipson (2010) --
>> http://www.springerlink.com/content/l79v2183725413w0/ -- is that
>> right?
>
> Yes. Basically traditional "naive" error estimators or fitness functions
> fail miserably when used in SR with implicit equations because they
> immediately close in on "best" fits like f(x) = x - x and other trivial
> solutions. In such cases no amount of regularization and complexity
> penalizing methods will help since x - x is fairly simple by most measures
> of complexity and it does have zero error. So the paper outlines such
> problems associated with "direct" error estimators and thus they infer the
> "triviality" of the fit by probing its estimates around nearby points and
> seeing if it does follow the pattern dictated by the data points -- ergo
> derivatives.
>
> Also, somewhat like a side benefit, this method also enables us to perform
> regression on closed loops and other implicit equations since the fitness
> functions are based only on derivatives. The specific form of the error is
> equation 1.2 which is what, I believe, comprises of the internals of the
> evaluation procedure used in Eureqa.
>
> You are correct in pointing out that there is no reason to not work in
> parallel, since GAs generally have a more or less fixed form
> (evaluate-reproduce cycle) which is quite easily parallelized. I have used
> OpenMP in the past, in which it is fairly trivial to parallelize well formed
> for loops.
>
> Chillu
>
>> It is not clear to me how well this generalized approach will
>> work in practice, but there is no reason not to proceed in parallel to
>> establish a framework under which you could implement the metrics
>> proposed by Schmidt and Lipson in the contemplated syrfr package.
>>
>> I have expanded the test I proposed with two more questions -- at
>> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
>> -- specifically:
>>
>> 5. Critique http://sites.google.com/site/gptips4matlab/
>>
>> 6. Use anova to compare the goodness-of-fit of a SSfpl nls fit with a
>> linear model of your choice. How can your characterize the
>> degree-of-freedom-adjusted goodness of fit of nonlinear models?
>>
>> I believe pairwise anova.nls is the optimal comparison for nonlinear
>> models, but there are several good choices for approximations,
>> including the residual standard error, which I believe can be adjusted
>> for degrees of freedom, as can the F statistic which TableCurve uses;
>> see: http://en.wikipedia.org/wiki/F-test#Regression_problems
>>
>> Best regards,
>> James Salsman
>>
>>
>> On Sun, Mar 7, 2010 at 7:35 PM, Chidambaram Annamalai
>> <quantumelixir at gmail.com> wrote:
>> > It's been a while since I proposed syrfr and I have been constantly in
>> > contact with the many people in the R community and I wasn't able to
>> > find a
>> > mentor for the project. I later got interested in the Automatic
>> > Differentiation proposal (adinr) and, on consulting with a few others
>> > within
>> > the R community, I mailed John Nash (who proposed adinr in the first
>> > place)
>> > if he'd be willing to take me up on the project. I got a positive reply
>> > only
>> > a few hours ago and it was my mistake to have not removed the syrfr
>> > proposal
>> > in time from the wiki, as being listed under proposals looking for
>> > mentors.
>> >
>> > While I appreciate your interest in the syrfr proposal I am afraid my
>> > allegiances have shifted towards the adinr proposal, as I got convinced
>> > that
>> > it might interest a larger group of people and it has wider scope in
>> > general.
>> >
>> > I apologize for having caused this trouble.
>> >
>> > Best Regards,
>> > Chillu
>> >
>> > On Mon, Mar 8, 2010 at 6:41 AM, James Salsman <jsalsman at talknicer.com>
>> > wrote:
>> >>
>> >> Per http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010
>> >> -- and
>> >>
>> >> http://rwiki.sciviews.org/doku.php?id=developers:projects:gsoc2010:syrfr
>> >> -- I am applying to mentor the "Symbolic Regression for R" (syrfr)
>> >> package for the Google Summer of Code 2010.
>> >>
>> >> I propose the following test which an applicant would have to pass in
>> >> order to qualify for the topic:
>> >>
>> >> 1. Describe each of the following terms as they relate to statistical
>> >> regression: categorical, periodic, modular, continuous, bimodal,
>> >> log-normal, logistic, Gompertz, and nonlinear.
>> >>
>> >> 2. Explain which parts of http://bit.ly/tablecurve were adopted in
>> >> SigmaPlot and which weren't.
>> >>
>> >> 3. Use the 'outliers' package to improve a regression fit maintaining
>> >> the correct extrapolation confidence intervals as are between those
>> >> with and without outlier exclusions in proportion to the confidence
>> >> that the outliers were reasonably excluded.  (Show your R transcript.)
>> >>
>> >> 4. Explain the relationship between degrees of freedom and correlated
>> >> independent variables.
>> >>
>> >> Best regards,
>> >>
>> >> James Salsman
>> >> jsalsman at talknicer.com
>> >> http://talknicer.com
>> >>
>> >> ______________________________________________
>> >> R-devel at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>
>



More information about the R-devel mailing list