[R] Regression with many independent variables

Thu Mar 3 22:08:46 CET 2011

Thanks greg,

 that formula was exactly what I was looking for. Except now when I
run it on my data I get the following error:

"Error in model.matrix.default(mt, mf, contrasts) : cannot allocate
vector of length 2043479998"

I know there are probably many 2-way interactions that are zero so I
thought I could save space by removing these. Is there some way that
can just delete all the two way interactions that are zero and keep
the columns that have non-zero entries? I think that will
significantly cut down the memory needed. Or is there just another way
to get around this?

thanks,
Matt

On Tue, Mar 1, 2011 at 3:56 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> You can use ^2 to get all 2 way interactions and ^3 to get all 3 way interactions, e.g.:
>
> lm(Sepal.Width ~ (. - Sepal.Length)^2, data=iris)
>
> The lm.fit function is what actually does the fitting, so you could go directly there, but then you lose the benefits of using . and ^.  The Matrix package has ways of dealing with sparse matricies, but I don't know if  that would help here or not.
>
> You could also just create x'x and x'y matricies directly since the variables are 0/1 then use solve.  A lot depends on what you are doing and what questions you are trying to answer.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: Matthew Douglas [mailto:matt.douglas01 at gmail.com]
>> Sent: Tuesday, March 01, 2011 1:09 PM
>> To: Greg Snow
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Regression with many independent variables
>>
>> Hi Greg,
>>
>> Thanks for the help, it works perfectly. To answer your question,
>> there are 339 independent variables but only 10 will be used at one
>> time . So at any given line of the data set there will be 10 non zero
>> entries for the independent variables and the rest will be zeros.
>>
>> One more question:
>>
>> 1. I still want to find a way to look at the interactions of the
>> independent variables.
>>
>> the regression would look like this:
>>
>> y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk
>>
>> so I think the regression in R would look like this:
>>
>> lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),
>>
>> my problem is that since I have technically 339 independent variables,
>> when I do this regression I would have 339 Choose 2 = approx 57000
>> independent variables (a vast majority will be 0s though) so I dont
>> want to have to write all of these out. Is there a way to do this
>> quickly in R?
>>
>> Also just a curious question that I cant seem to find to online:
>> is there a more efficient model other than lm() that is better for
>> very sparse data sets like mine?
>>
>> Thanks,
>> Matt
>>
>>
>> On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org> wrote:
>> > Don't put the name of the dataset in the formula, use the data
>> argument to lm to provide that.  A single period (".") on the right
>> hand side of the formula will represent all the columns in the data set
>> that are not on the left hand side (you can then use "-" to remove any
>> other columns that you don't want included on the RHS).
>> >
>> > For example:
>> >
>> >> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
>> >
>> > Call:
>> > lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
>> >
>> > Coefficients:
>> >      (Intercept)       Petal.Length        Petal.Width
>>  Speciesversicolor
>> >           3.0485             0.1547             0.6234            -
>> 1.7641
>> >  Speciesvirginica
>> >          -2.1964
>> >
>> >
>> > But, are you sure that a regression model with 339 predictors will be
>> meaningful?
>> >
>> > --
>> > Gregory (Greg) L. Snow Ph.D.
>> > Statistical Data Center
>> > Intermountain Healthcare
>> > greg.snow at imail.org
>> > 801.408.8111
>> >
>> >
>> >> -----Original Message-----
>> >> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> >> project.org] On Behalf Of Matthew Douglas
>> >> Sent: Monday, February 28, 2011 1:32 PM
>> >> To: r-help at r-project.org
>> >> Subject: [R] Regression with many independent variables
>> >>
>> >> Hi,
>> >>
>> >> I am trying use lm() on some data, the code works fine but I would
>> >> like to use a more efficient way to do this.
>> >>
>> >> The data looks like this (the data is very sparse with a few 1s, -1s
>> >> and the rest 0s):
>> >>
>> >> > head(adj0708)
>> >>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
>> >> 1   64.28571   29    0    0    0    0    0    0   0    0    0    0
>> >> 0    0    0
>> >> 2 -100.00000    6    0    0    0    0    0    0   0    1    0    0
>> >> 0    0    0
>> >> 3  100.00000    4    0    0    0    0    0    0   0    1    0    0
>> >> 0    0    0
>> >> 4  -33.33333    7    0    0    0    0    0    0   0    0    0    0
>> >> 0    0    0
>> >> 5  200.00000    2    0    0    0    0    0    0   0    0    0    0
>> >> -1    0    0
>> >> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0    0
>> >> 0    0    0
>> >>
>> >> adj0708 is actually a 35657x341 data set. Each column after "Poss"
>> is
>> >> an independent variable, the dependent variable is "MARGIN" and it
>> is
>> >> weighted by "Poss"
>> >>
>> >>
>> >> The regression is below:
>> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
>> >> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
>> >> adj0708$P605 + adj0708$P337 + .... +
>> >> adj0708$P510,weights=adj0708$Poss)
>> >>
>> >> I have two questions:
>> >>
>> >> 1. Is there a way to to condense how I write the independent
>> variables
>> >> in the lm(), instead of having such a long line of code (I have 339
>> >> independent variables to be exact)?
>> >> 2. I would like to pair the data to look a regression of the
>> >> interactions between two independent variables. I think it would
>> look
>> >> something like this....
>> >> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
>> >> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
>> >> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
>> >> but there will be 339 Choose 2 combinations, so a lot of independent
>> >> variables! Is there a more efficient way of writing this code. Is
>> >> there a way I can do this?
>> >>
>> >> Thanks,
>> >> Matt
>> >>
>> >> ______________________________________________
>> >> R-help at r-project.org mailing list
>> >> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> PLEASE do read the posting guide http://www.R-project.org/posting-
>> >> guide.html
>> >> and provide commented, minimal, self-contained, reproducible code.
>> >
>