[R] Leave One Out Cross Validation

David Winsemius dwinsemius at comcast.net
Sat Jun 20 22:14:24 CEST 2009


You don't seem to be making any corrections or updating your code.  
There remains a syntax error in the last line of cvhfunc because of  
mismatched parens.

On Jun 20, 2009, at 1:04 PM, muddz wrote:

>
> Hi David,
>
> Thanks and I apologize for the lack of clarity.
>
> #n is defined as the length of xdat
> n<-length(xdat)

Would be better to include that at the top of  the function body.
>
> #I defined 'k' as the Gaussian kernel function
> k<-function(v) {1/sqrt(2*pi)*exp(-v^2/2)} #GAUSSIAN kernal
>
> #I believe ypred in my case, was the leave one out estimator (I  
> think its
> the variable x, in other words xdat or independent variable).

It does not make sense to me that ypred or yhat would be an estimator  
of x.

> I could be wrong, but from the text of Davidson and MacKinnon, thats  
> what I got out of
> it.

Don't have whatever text. The Google hits suggest it might be  
"Econometric Theory and Methods". Amazon  has one of those look inside  
options. Searching on cv(h) suggests you are trying to replicate  
kernel regression on p 686.
>
> #I totally hear you on the fact that ypred will just equal y[i],  
> however it
> should not be the case because for each x[i], I should be running  
> all x[j]
> except for when x[i]=x[j]. And I should be mulitiplying the  
> numerator of
> ypred (y[i]*k( (x[j]-x[i])/h ). However, its not doing that.

Not doing what? You need to decide what is wrong.

(y[i]*k((x[j]-x[i])/h))/k((x[j]-x[i])/h)  = y[i] * 1  # by definition,  
unless x[j] = x[i], in which case it is NaN

... which is what in the vector that your current function produces  
after fixing some syntax. Doing some debugging it looks as though the  
first time through with i=1 and j=2 that the results of the calucation  
is NaN which means that ypred will always be NaN. Just because it is  
done over a loop does not fix the error in numerical logic and failure  
to check arguments.  So you still need to decide how to trap  
situations where x[j] = x[i]. And you need to re-read D&M and  
implement the weights that prevent using using terms when k((x[j]- 
x[i])/h) is very small.

(There is only a single summation index in that CV(h) formula. So  
maybe you could just run predict(lm(y ~ . , data=x[-i]))  instead  of  
that god-awful nested loop.)

>
> #I believe CV should be the following: It is used to determine optimal
> bandwidth. Steps:

If CV is a scalar then the last line could not possibly give you what  
you want:

 > (1 - 1:10)   #scalar minus vector gives vector.
  [1]  0 -1 -2 -3 -4 -5 -6 -7 -8 -9

Don't you want a sum?

> (1)The process involves running x, [j] times for each x[i], except  
> for when
> x[j]=x[i]. This is similiar to drawing a histogram and finding how how
> large/small the bins should be. ypred should be a vector of nx1


> (2) The second step is similiar to a variance measure, where take the
> difference of y and (1), square, and than sum for all n.

You are going to need to define "second step".
>
> # I was not sure with this> " ypred

That line looks useless to me at the moment. ypred is whatever it is  
at that point, and it won't return a value from the function because  
you later evaluate other expressions. Functions only return the last  
evaluated expression or what is wrapped in return().


> # not sure what that will do inside
> the function. If
> it's there for debugging you may want to print(ypred)".
>
> Thanks.
> -Mudasir
>
>
>
>>                             }
>
> David Winsemius wrote:
>>
>>
>> On Jun 19, 2009, at 7:45 PM, muddz wrote:
>>
>>>
>>> Hi Uwe,
>>>
>>> My apologies.
>>>
>>> Please if I can be guided what I am doing wrong in the code. I
>>> started my
>>> code as such:
>>>
>>> # ypred is my leave one out estimator of x
>>
>> Estimator of x? Really?
>>
>>> cvhfunc<-function(y,x,h)    {
>>>    ypred<-0
>>>    for (i in 1:n){   #how did you define "n"? It's not in your
>>> parameter list
>>>       for (j in 1:n){
>>>       if (j!=i){
>>>            ypred<-ypred +    (y[i]*k( (x[j]-x[i])/h ) )/
>>                                       k( (x[j]-x[i])/h )
>>
>> At this point you are also using "k" as a function. Have you at any
>> point defined "k"?
>>
>> Also multiplication of the ratio of two identical numbers would
>> generally give a result of y[i] for that second term. Unless, of
>> course, any x[j] = x[i] in which case you will throw an error and
>> every thing will grind to a halt.
>>
>> It might help if you now explained what you think "CV" should be.
>>
>>>                }
>>>                  }   }
>>>    ypred     # not sure what that will do inside the function. If
>>> it's there for debugging you may want to print(ypred)
>>>
>>> #CVh is a
>>
>> # Yes? CVh is a ....what ?
>>
>>>     cvh<-0
>>>     cvh<-cvh+((1/n)*(y-ypred)^2    # n again. R will still not know
>>> what that is.
>>>     cvh
>>
>> # ypred is a scalar, while y is a vector, so cvh will be a vector. Is
>> that what you want?
>>
>>>                             }
>>
>>> test2<-cvhfunc(ydat,xdat,.2);test2
>>>
>>> #I was experimenting with the following data:
>>> library(datasets)
>>> data(faithful)
>>> ydat<-faithful$eruptions;ydat;plot(ydat);par(new=T)
>>> xdat<-faithful$waiting;xdat;plot(xdat,col="blue")
>>>
>>> # I want to minimize the CV function with respect to h. Thanks.
>>>
>>>
>>>
>>>
>>> Uwe Ligges-3 wrote:
>>>>
>>>> See the posting guide:
>>>> If you provide commented, minimal, self-contained, reproducible  
>>>> code
>>>> some people may be willing to help on the list.
>>>>
>>>> Best,
>>>> Uwe Ligges
>>>>
>>>>
>>>> muddz wrote:
>>>>> Hi All,
>>>>>
>>>>> I have been trying to get this LOO-Cross Validation method to work
>>>>> on R
>>>>> for
>>>>> the past 3 weeks but have had no luck. I am hoping someone would
>>>>> kindly
>>>>> help
>>>>> me.
>>>>>
>>>>> Essentially I want to code the LOO-Cross Validation for the 'Local
>>>>> Constant'
>>>>> and 'Local Linear' constant estimators. I want to find optimal h,
>>>>> bandwidth.
>>>>>
>>>>> Thank you very much!
>>>>> -M
>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>
>>> -- 
>>> View this message in context:
>>> http://www.nabble.com/Leave-One-Out-Cross-Validation-tp24025738p24120380.html
>>> Sent from the R help mailing list archive at Nabble.com.
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> David Winsemius, MD
>> Heritage Laboratories
>> West Hartford, CT
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Leave-One-Out-Cross-Validation-tp24025738p24127218.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT




More information about the R-help mailing list