[R] Regression with many independent variables

Greg Snow Greg.Snow at imail.org
Tue Mar 1 21:56:27 CET 2011


You can use ^2 to get all 2 way interactions and ^3 to get all 3 way interactions, e.g.:

lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)

The lm.fit function is what actually does the fitting, so you could go directly there, but then you lose the benefits of using . and ^.  The Matrix package has ways of dealing with sparse matricies, but I don't know if  that would help here or not.

You could also just create x'x and x'y matricies directly since the variables are 0/1 then use solve.  A lot depends on what you are doing and what questions you are trying to answer.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
> Sent: Tuesday, March 01, 2011 1:09 PM
> To: Greg Snow
> Cc: r-help at r-project.org
> Subject: Re: [R] Regression with many independent variables
> 
> Hi Greg,
> 
> Thanks for the help, it works perfectly. To answer your question,
> there are 339 independent variables but only 10 will be used at one
> time . So at any given line of the data set there will be 10 non zero
> entries for the independent variables and the rest will be zeros.
> 
> One more question:
> 
> 1. I still want to find a way to look at the interactions of the
> independent variables.
> 
> the regression would look like this:
> 
> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
> 
> so I think the regression in R would look like this:
> 
> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),
> 
> my problem is that since I have technically 339 independent variables,
> when I do this regression I would have 339 Choose 2 = approx 57000
> independent variables (a vast majority will be 0s though) so I dont
> want to have to write all of these out. Is there a way to do this
> quickly in R?
> 
> Also just a curious question that I cant seem to find to online:
> is there a more efficient model other than lm() that is better for
> very sparse data sets like mine?
> 
> Thanks,
> Matt
> 
> 
> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> > Don't put the name of the dataset in the formula, use the data
> argument to lm to provide that.  A single period (".") on the right
> hand side of the formula will represent all the columns in the data set
> that are not on the left hand side (you can then use "-" to remove any
> other columns that you don't want included on the RHS).
> >
> > For example:
> >
> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
> >
> > Call:
> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
> >
> > Coefficients:
> >      (Intercept)       Petal.Length        Petal.Width
>  Speciesversicolor
> >           3.0485             0.1547             0.6234            -
> 1.7641
> >  Speciesvirginica
> >          -2.1964
> >
> >
> > But, are you sure that a regression model with 339 predictors will be
> meaningful?
> >
> > --
> > Gregory (Greg) L. Snow Ph.D.
> > Statistical Data Center
> > Intermountain Healthcare
> > greg.snow at imail.org
> > 801.408.8111
> >
> >
> >> -----Original Message-----
> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> >> project.org] On Behalf Of Matthew Douglas
> >> Sent: Monday, February 28, 2011 1:32 PM
> >> To: r-help at r-project.org
> >> Subject: [R] Regression with many independent variables
> >>
> >> Hi,
> >>
> >> I am trying use lm() on some data, the code works fine but I would
> >> like to use a more efficient way to do this.
> >>
> >> The data looks like this (the data is very sparse with a few 1s, -1s
> >> and the rest 0s):
> >>
> >> > head(adj0708)
> >>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
> >> 1   64.28571   29    0    0    0    0    0    0   0    0    0    0
> >> 0    0    0
> >> 2 -100.00000    6    0    0    0    0    0    0   0    1    0    0
> >> 0    0    0
> >> 3  100.00000    4    0    0    0    0    0    0   0    1    0    0
> >> 0    0    0
> >> 4  -33.33333    7    0    0    0    0    0    0   0    0    0    0
> >> 0    0    0
> >> 5  200.00000    2    0    0    0    0    0    0   0    0    0    0
> >> -1    0    0
> >> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0    0
> >> 0    0    0
> >>
> >> adj0708 is actually a 35657x341 data set. Each column after "Poss"
> is
> >> an independent variable, the dependent variable is "MARGIN" and it
> is
> >> weighted by "Poss"
> >>
> >>
> >> The regression is below:
> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
> >> adj0708$P605 + adj0708$P337 + .... +
> >> adj0708$P510,weights=adj0708$Poss)
> >>
> >> I have two questions:
> >>
> >> 1. Is there a way to to condense how I write the independent
> variables
> >> in the lm(), instead of having such a long line of code (I have 339
> >> independent variables to be exact)?
> >> 2. I would like to pair the data to look a regression of the
> >> interactions between two independent variables. I think it would
> look
> >> something like this....
> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
> >> but there will be 339 Choose 2 combinations, so a lot of independent
> >> variables! Is there a more efficient way of writing this code. Is
> >> there a way I can do this?
> >>
> >> Thanks,
> >> Matt
> >>
> >> ______________________________________________
> >> R-help at r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/posting-
> >> guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
> >



More information about the R-help mailing list