[R] lokern package

Thu Jul 2 19:00:03 CEST 2009

Dear Martin,

Thank you for the propmt and highly informative reply!

In my experiments, I compared (a) smooth.spline (with both GCV and CV for
automatic bw selection), (b) spm() in "SemiPar", which implements penalized
splines smoothing with REML estimation of penalty, (c) glkerns() in
"lokern", which implements the Gasser-Mueller kernel methodology, and (d)
locpoly() in "KernSmooth" with direct-plugin method of Ruppert and Wand to
estimate bandwidth. All of these smothers also have the derivative
estimation.   

The key point that you observed as the reason for why glkerns() cannot be
made more efficient, without making it inferior, is that the bandwidth that
is "optimal" for function estimation is not necessarily optimal for
derivative estimation.  This is explicitly considered in derivative
estimation in glkerns().  Other smoothers, however, do not recognize this,
i.e. they use the same bandwith (and the same function estimate, of course)
that is optimal only for function estimation to estimate derivatives.  In
fact, this is one of the main points of my work.  Not surprisingly,
glkerns() does a nice job of estimating f, f', and f'', whereas the other
methods, with the exception of spm(), get progressively worse with
increasing order of derivatives (I use RMISE - root-mean-integrated squared
error to judge quality of smoothers).  A slight plug - I will be presenting
these results at UseR! 2009 next week.

I knew that glkerns(x, y, deriv=1) is a completely independent estimation
from glkerns(x, y, deriv=0). What I was really wondering was whether it
might not be possible to have a call such as:  glkerns(x, y, deriv=c(0, 1,
2), x.out=x.out).  This would minimize some overhead associated with setting
up repeated calls each time in R and Fortran.  While the savings may not be
huge for a single function smoothing, it might be substantial for large
numbers of time-series.

Best regards,
Ravi.

----------------------------------------------------------------------------
-------

Ravi Varadhan, Ph.D.

Assistant Professor, The Center on Aging and Health

Division of Geriatric Medicine and Gerontology 

Johns Hopkins University

Ph: (410) 502-2619

Fax: (410) 614-9625

Email: rvaradhan at jhmi.edu

Webpage:
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
tml

----------------------------------------------------------------------------
--------

-----Original Message-----
From: Martin Maechler [mailto:maechler at stat.math.ethz.ch] 
Sent: Thursday, July 02, 2009 12:35 PM
To: Ravi Varadhan
Cc: r-help at r-project.org; Martin Maechler
Subject: Re: [R] lokern package

>>>>> "RV" == Ravi Varadhan <RVaradhan at jhmi.edu>
>>>>>     on Thu, 2 Jul 2009 10:51:03 -0400 writes:

    RV> Dear Martin, I have been playing a lot with the
    RV> glkerns() function in the "lokern" package for
    RV> "automatic" smoothing of time-series data.  This kernel
    RV> smoothing approach of Gasser and Mueller seems to
    RV> perform quite well for estimating the function and its
    RV> derivatives (first and second derivatives).  In fact,
    RV> this is one of the best methods based on my simulation
    RV> studies for comparing a number of "automatic" bandwidth
    RV> selection methods.  

That's quite interesting! 
What methods did you use in your comparison?
[I assume you did use  smooth.spline() as well]

    RV> I am interested in applying this to
    RV> automatic smoothing and feature extraction for a "large"
    RV> number (thousands) of time series, with hundreds of
    RV> points per time series.  This is where I am interested
    RV> in seeing if the efficiency of "glkerns" can be
    RV> improved.

    RV> Here follows my specific question:

    RV> You have to call glkerns() separately each time to
    RV> compute the function and its derivatives, ie. if I want
    RV> the function and its first 2 derivatives, I have to make
    RV> 3 calls to glkerns().  

Yes.  Note however that each of these calls chooses a different kernel order
('korder') by default { korder <- deriv + 2 }, and uses *different*
automatic bandwidths depending on both deriv and korder.

Consequently, the result of	     glkerns(x,y, deriv=1)  is 
*not* the derivative of the estimate glkerns(x,y, deriv=0) but rather an
independent estimate of the derivative of the unknown g(). The same applies
for deriv=2 etc.

But the problem is even "worse": Let's assume you get the "optimal" global
bandwidth estimate 'bandwidth' = h  from the deriv=0 case.
Then you could set this bandwidth explicitly for the deriv=1 and
deriv=2 calls {and you'd gain quite a bit of execution time !}.
*HOWEVER*, as the internal codes absolutely require

    k_{ord} - nu  ==  korder - nu  to be an *even* number,

you can *not* keep 'korder' fixed together with 'bandwidth'.
But if you change  'korder', the kernels change and the meaning of the
bandwidth has changed too.

I have just uploaded version 1.0-8 of package 'lokern' to CRAN, and this now
contains a demo("gkl-derivs") which defines a utility function to play a bit
with this (keeping bandwidth fixed).

    RV> This seems to me to be inefficient, especially for large
    RV> time series.  In smooth.spline(), for example, you can
    RV> call it once to get the fit, and then use the fitted
    RV> object to compute the derivatives using predict().  

Note that smooth.spline() is very different:
If you use deriv=1 or deriv=2 in the predict() call, you get exactly the 1st
or 2nd derivative of the same smoothing spline function.

    RV> Is it possible to have a similar feature in glkerns?

As I have explained, this is not easily possible currently more for
conceptual reasons than others.

In principle it should be possible however, but that would both need some
"theoretical" work {maybe just collecting the papers and formulae used} and
then some Fortran code digging and shuffling.

Would be a nice "semester work" for a student...
Martin

--
Martin <Maechler at stat.math.ethz.ch>  http://stat.ethz.ch/people/maechler
Seminar für Statistik, ETH Zürich  HG G 16      Rämistrasse 101
CH-8092 Zurich, SWITZERLAND
phone: +41-44-632-3408       fax: ...-1228      <><

    RV> Thanks for any suggestions.

    RV> Ravi.
    RV>
----------------------------------------------------------------------------
    RV> -------

    RV> Ravi Varadhan, Ph.D.

    RV> Assistant Professor, The Center on Aging and Health

    RV> Division of Geriatric Medicine and Gerontology

    RV> Johns Hopkins University

    RV> Ph: (410) 502-2619

    RV> Fax: (410) 614-9625

    RV> Email: rvaradhan at jhmi.edu

    RV> Webpage:
    RV>
http://www.jhsph.edu/agingandhealth/People/Faculty_personal_pages/Varadhan.h
    RV> tml

    RV>
----------------------------------------------------------------------------
    RV> --------

    RV> 	[[alternative HTML version deleted]]

    RV> ______________________________________________
    RV> R-help at r-project.org mailing list
    RV> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do
    RV> read the posting guide
    RV> http://www.R-project.org/posting-guide.html and provide
    RV> commented, minimal, self-contained, reproducible code.