[R] Root mean square on binned GAM results

David Winsemius dwinsemius at comcast.net
Sat Jun 19 16:17:33 CEST 2010

```I have replied offlist to Mr. Jarvis with my reasons for not
should feel free to step in if they are so inclined.

--
David.

On Jun 19, 2010, at 12:44 AM, David Jarvis wrote:

> Hi, David.
>
> Let me start at the beginning. Between the years (y) 1900 to 2009 I
> have some observed temperature readings (o). For example:
>
> y <- seq(1900, 2009)
> o <- runif(110, 9, 15)
>
> So the ordering is fixed: y and o are a time series (shown in the
> linked image below). I then calculate a naïve, non-parameterised
> model (m) of the data using GAM, as follows:
>
> m <- data.frame( x, fitted( gam( y ~ s(x) ) ) )
>
> The values from m are then actually plotted as the trend line
> depicted at:
>
> http://i.imgur.com/X0gxV.png
>
> What I am trying to do now is to calculate how accurately GAM fits
> the data. The suggestion I was given was to use RMSE on the observed
> data versus the model data. It was also suggested that I use mean
> bins, with each bin containing 5 values, to reduce the amount of
> error in the calculation. Algorithmically, I pictured it as:
> 	• Let index = 1
> 	• Let size = 5
> 	• Let o = vector of observed data
> 	• Let ob = empty vector
> 	• Append mean( o[index:index+size-1] ) into ob
> 	• Let index = index + size
> 	• Repeat from Step 5 until no more elements in o
> At this point, ob would contain the average of: the first five
> values, the second five values, and so on. Thus length( ob ) =
> round(length( o ) / 5).
>
> I would then repeat the same calculation on m to get mb, the model's
> bins.
>
> With those averages, I could use ob and mb to calculate the normal
> root mean square deviation:
>
> nrmse <- sqrt( mean( ob - mb ) ^ 2 ) / (max( ob ) - min( ob ))
>
> Then turn that into a percentage:
>
> 100 - nmse
>
> At that point I was hoping I could say that, in general, the result
> indicates how closely the model fits the data. The closer to 100%,
> the more accurate the trend line.
>
> As you can tell, I have very little experience in statistics and R
> so any feedback, suggestions, or general guidance would be greatly
> appreciated.
>
> Dave
>
> P.S.
> The years, the type of weather data, and the locations that the
> measurements were taken can all be selected by users when they
> generate the report. So sometimes the data will have 110 years,
> inclusive, other times it could be 37 years (thus 37 data points).
> So choosing to average 5 elements per bin is a bit arbitrary... I am
> looking to get something working first before tweaking the possible
> parameters for the calculation.
>
> Thanks again!

David Winsemius, MD
West Hartford, CT

```