# [R] Root mean square on binned GAM results

David Winsemius dwinsemius at comcast.net
Sat Jun 19 05:38:20 CEST 2010

```On Jun 18, 2010, at 11:08 PM, David Winsemius wrote:

>
> On Jun 18, 2010, at 10:38 PM, David Jarvis wrote:
>
>> Hi, David.
>>
>> accurately reflect how closely the model (GAM) fits the data. I was
>> told
>>
>> This was my presumption; I could be mistaken.
>>
>> that the accuracy of the correlation can be improved using a root
>> mean
>> square deviation (RMSD) calculation on binned data.
>>
>> By whom? ...  and with what theoretical basis?
>>
>> I talked with Christian Schunn. He mentioned that using RMSD would
>> produce a better result for goodness-of-fit (if that term is not
>> synonymous with correlation, I apologise -- I'm still rather new to
>> this level of statistics):
>>
>> http://www.lrdc.pitt.edu/schunn/gof/index.html
>>
>> It was regarding a chart similar to:
>>
>> http://i.imgur.com/X0gxV.png
>>
>> In the chart, the calculation for Pearson's, Spearman's, and
>> Kendall's Tau provide, in my opinion, an incorrect indicator as to
>> the strength of GAM's fit to the data. I could be wrong here, too.
>>
>> His suggestion was to use bin the means (in groups of 5 or so) to
>> reduce the noise.
>>
>> I doubt that your strategy offers any statistical advantage, but if
>> you want to play around with it then consider:
>>
>> binned.x <- round( (x + 2.5)/5)
>>
>> > d <-
>> c
>> (1,3,5,4,3,6,3,1,5,7,8,9,4,3,2,7,3,6,8,9,5,3,1,4,5,8,9,3,3,2,5,7,8,8,5,4,3,2,6,4,3,1,4,5,6,8,9,0,7,7,5,4,3,3,2,1,3,4,5,6,7,9,0,2,4,3,3
>> )
>> > binned.d <- round( (d + 2.5)/5)
>> > print(binned.d)
>> [1] 1 1 2 1 1 2 1 1 2 2 2 2 1 1 1 2 1 2 2 2 2 1 1 1 2 2 2 1 1 1 2 2
>> 2 2 2 1 1 1
>> [39] 2 1 1 1 1 2 2 2 2 0 2 2 2 1 1 1 1 1 1 1 2 2 2 2 0 1 1 1 1
>>
>> That doesn't make sense to me.
>
> Then I blame your powers of exposition. Without some sort of
> explicit example the parsing of English is very prone to error. If
> you want to pick out elements of x in some pre-specified order in
> groups of five then consider:
>
> > x <- 1:100
> >
> > rep(1:20, each=5)
>  [1]  1  1  1  1  1  2  2  2  2  2  3  3  3  3  3  4  4  4  4  4  5
> 5  5
> [24]  5  5  6  6  6  6  6  7  7  7  7  7  8  8  8  8  8  9  9  9  9
> 9 10
> [47] 10 10 10 10 11 11 11 11 11 12 12 12 12 12 13 13 13 13 13 14 14
> 14 14
> [70] 14 15 15 15 15 15 16 16 16 16 16 17 17 17 17 17 18 18 18 18 18
> 19 19
> [93] 19 19 19 20 20 20 20 20
> > tapply(x, rep(1:20, each=5), mean)
> 1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20   # this
> row is just indices
> 3  8 13 18 23 28 33 38 43 48 53 58 63 68 73 78 83 88 93 98  # this
> row is the means
>
> If you wanted them in random groups of roughly 5, then you could use
> sample(x, prob=rep(5/n, n/5))
>
>
>> My impression was that I should try to put every 5 values in a bin,
>> average that bin, then calculate the RMSD between the observed
>> values and the values from GAM. In other words (o is observed and m
>> is model):
>
> Do you intend that m[n] would be the predicted value from a model?
> How are you forming the groups of 5? Are they ordered? If so ordered
> by observed of by predicted? (In R a "model" is a complex list
> structure, but may in some cases have a simple predicted value for
> each case. Again a specific example might work wonders.

Looking a bit more at that web page and its linked article and Excel
spreadsheet, it appears you are hoping to construct calibration plots.
It appears that the observations are sorted and then binned by
observed predictor (rather than predicted) values. You then compare
the summed GOF statistic on averages of the predicted and observed
means for some "model" within those bins ... the nature of which is
not at all clear to my eyes at this point. Sounds like a calibration
analysis. I think R packages can offer more sophisticated methods. But
you can at any rate use the methods I offered to bin your cases sorted
on the predictor values.

>
> --
> David.
>
>>
>>  bins <- 5
>>
>>  while( length(o) %% bins != 0 ) {
>>    o <- o[-length(o)]
>>  }
>>  omean <- apply( matrix(o, bins), 2, mean )
>>
>>  while( length(m) %% bins!= 0 ) {
>>    m <- m[-length(m)]
>>  }
>>  mmean <- apply( matrix(m, bins), 2, mean )
>>
>>  sqrt( mean( omean - mmean ) ^ 2 )
>>
>> But that feels sloppy, error prone, and fragile.
>>
>> Joris mentioned that I could try using tapply with
>> cut(d,round(length(d)/5)). I couldn't figure out how to get the
>> means back from the factors.
>>
>> Dave
>>
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT

```