[R] Diag "Hat" matrix

Thu Jun 7 18:19:51 CEST 2001

Kenneth Cabrera <krcabrer at perseus.unalmed.edu.co> writes:

> What is the difference between in the computation of the diag of the
> "hat" matrix in:
> "lm.influence" and the matrix operations with "solve()" and "t()"?

> I mean, this is my X matrix

>         x1    x2    x3    x4     x5
>  [1,] 0.297 0.310 0.290 0.220 0.1560
>  [2,] 0.360 0.390 0.369 0.297 0.2050
>  [3,] 0.075 0.058 0.047 0.034 0.0230
>  [4,] 0.114 0.100 0.081 0.058 0.0420
>  [5,] 0.229 0.213 0.198 0.142 0.0102
>  [6,] 0.315 0.304 0.267 0.202 0.1470
>  [7,] 0.477 0.518 0.496 0.395 0.2850
>  [8,] 0.072 0.063 0.047 0.036 0.0240
>  [9,] 0.099 0.092 0.074 0.056 0.0038
> [10,] 0.420 0.452 0.425 0.332 0.2350
> [11,] 0.189 0.178 0.153 0.107 0.0760
> [12,] 0.369 0.391 0.364 0.286 0.2000
> [13,] 0.142 0.124 0.105 0.077 0.0560
> [14,] 0.094 0.087 0.072 0.049 0.0320
> [15,] 0.171 0.161 0.145 0.094 0.0680
> [16,] 0.378 0.420 0.380 0.281 0.2000

If your model formula is y~x1+x2+x3+x4+x5 then the model matrix would
have a column of 1's followed by what you wrote.  Try defining X as
 X <- model.matrix(y~x1+x2+x3+x4+x5, data = <whatever>)
and re-doing your calculation.

In general, please note that

> diag(X%*%solve(t(X)%*%X)%*%t(X))

is a *very* bad way of calculating the influences.  Whenever you find
yourself writing something like solve(t(X)%*%X) you're doing the
numerical linear algebra badly.  The rule of thumb in numerical
linear algebra is that if the best way you can think of doing a
calculation involves taking the inverse of a matrix then you need to
think more about the calculation.

The way that lm.influence calculates the influences is through an
orthogonal-triangular decomposition which will be both faster and more
accurate than what you have written.

On today's machines "faster" doesn't matter that much for small or
moderate sized data sets.  On very large data sets faster still does
matter. "Accurate" should always be a concern.

> I obtain:
>  [1] 0.15248181 0.27102872 0.11476375 0.12941386 0.90455886 0.32246292
>  [7] 0.43858581 0.16533854 0.37415984 0.19100227 0.17023090 0.15125134
> [13] 0.17855019 0.06023773 0.52137996 0.85455350
> 
> But when I use the lm.influence() function
> lm.influence(mt)$hat
> I obtain:
>  [1] 0.1735989 0.2999146 0.2334095 0.1455117 0.9216644 0.7553856
> 0.4486403
>  [8] 0.2755802 0.4188349 0.1914242 0.1790093 0.1573939 0.1787553
> 0.1975511
> [15] 0.5664988 0.8568274
> mt is a model of the type y~x1+x2+x3+x4+x5, where y is:
> y
> [1]  17  17  35  69  69 173 173  17  17  73  17  35  69  35  35  52
> 
> As you see the differences are no too small.
> Where is the problem? Is only a numerical stability problem?
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._