[R] lasso based selection for mixed model

Fri Nov 6 19:16:34 CET 2009

I have been working on lasso type model selection
 using AD Model Builders random effects module.
I came across a few questions about this topic on the list
so I thought this would be of interest.
The advantage of AD Model Builder is that it lets you formulate
general nonlinear random effects models which is I believe
necessary for the lasso type approach.  I think the approach would work
for any kind of RE model such as say a negative binomial mixed model,
but lets keep it simple for now.

Let the fixed effect parameters be a_i
The usual lasso penalty then is

        p* sum_i abs(a_i)

with tunable penalty weight
Since AD Model Builders random effects module can deal with arbitrary
nonlinear random effects models. I tried simply adding this to the -LL.
However what happens is that in order to get the penalty large enough to
get the lasso effect the estimation procedure simply  makes all the a_i
too small and uses the RE's instead to fit the data.  I figured what was
needed was a penalty term which is more encouraging for setting most of
the terms to 0 and making a few of them large.  Using a square root
would have this property so I tried

             p* sum_i sqrt*(abs(a_i))

Of course the thing is not differentiable at 0 so I smoothed it off a
bit there, but that is the general idea.
To test it I generated data of the form

       Y_ij = a(1) *X_1i  + u_i  +e_ij

for 1<=i<=500  1<=j<=5

and a(1)=2.0  and std_dev u_i =0.6  std dev e_ij = 1.5

and the IV X_1i  is  std normal.

For the other IV's  I generated 9 more  IV's  X_ki  for 2<=k<=10

with   X_ki =  X_1i + .1*V_ki

where the V_li's are std normal

The correlation matrix for them is

 1 0.995 0.995 0.995 0.995 0.995 0.995 0.995 0.995 0.995
 0.995 1 0.99 0.99 0.989 0.988 0.99 0.989 0.99 0.99
 0.995 0.99 1 0.99 0.989 0.99 0.99 0.99 0.989 0.99
 0.995 0.99 0.99 1 0.99 0.989 0.99 0.99 0.99 0.99
 0.995 0.989 0.989 0.99 1 0.989 0.99 0.989 0.99 0.99
 0.995 0.988 0.99 0.989 0.989 1 0.99 0.99 0.989 0.99
 0.995 0.99 0.99 0.99 0.99 0.99 1 0.991 0.99 0.991
 0.995 0.989 0.99 0.99 0.989 0.99 0.991 1 0.988 0.99
 0.995 0.99 0.989 0.99 0.99 0.989 0.99 0.988 1 0.99
 0.995 0.99 0.99 0.99 0.99 0.99 0.991 0.99 0.99 1

so these are highly correlated explanatory variables.
Then I fit the model with no lasso penalty.

index   name   value      std dev
    1   a      3.1565e+00 1.2302e+00
    2   a      6.7390e-02 3.8737e-01
    3   a     -1.9154e-01 4.0204e-01
    4   a     -2.7920e-01 3.9847e-01
    5   a      3.0924e-02 3.9842e-01
    6   a      7.6508e-02 3.8698e-01
    7   a     -5.3902e-01 4.1590e-01
    8   a     -2.4941e-01 4.0141e-01
    9   a      2.5324e-01 3.8875e-01
   10   a     -2.9435e-01 4.1345e-01
   11   sig_u  8.2845e-01 3.0015e-02

so a(1) is rather badly estimated and its estimated std dev is large.

One way to try and deal with this is ridge regression. This is like
adding the following quadratic penalty.

      p* sum_i square(a_i)

Of course we know that this will have exactly the opposite effect to
what we want that is it tends to make all the estimated parameters non zero.
With a value of p=1.0 I got the following estimates.

 index   name   value      std dev
     1   a      8.7729e-01 5.9032e-01
     2   a      2.9461e-01 3.2352e-01
     3   a      8.3218e-02 3.3546e-01
     4   a      8.3569e-03 3.3438e-01
     5   a      2.7345e-01 3.2984e-01
     6   a      2.5172e-01 3.2822e-01
     7   a     -1.9834e-01 3.4721e-01
     8   a      2.7677e-02 3.3404e-01
     9   a      4.0135e-01 3.2855e-01
    10   a      5.7042e-03 3.4325e-01
    11   sig_u  8.3188e-01 3.0163e-02

So the std devs are smaller but the results are really bad.

Now the fun part. With a penalty weight p=10 and the
sqrt(abs()) penalty I got

index   name   value      std dev
    1   a      2.0191e+00 4.0619e-02
    2   a      6.7300e-06 1.2917e-03
    3   a      3.7718e-06 1.2915e-03
    4   a      3.0421e-06 1.2914e-03
    5   a      6.0954e-06 1.2917e-03
    6   a      5.9105e-06 1.2916e-03
    7   a      4.2511e-07 1.2914e-03
    8   a      2.4084e-06 1.2914e-03
    9   a      8.3686e-06 1.2919e-03
   10   a      2.7657e-06 1.2914e-03
   11   sig_u  8.3226e-01 3.0119e-02

So this looks like a promising approach.
Of course the real test for this is does it
work for real data?  If anyone wants to cooperate
on this it could be fun.

   Dave

-- 
David A. Fournier
P.O. Box 2040,
Sidney, B.C. V8l 3S3
Canada
Phone/FAX 250-655-3364
http://otter-rsch.com