[R] Regression with many independent variables

Matthew Douglas matt.douglas01 at gmail.com
Tue Mar 1 21:09:01 CET 2011


Hi Greg,

Thanks for the help, it works perfectly. To answer your question,
there are 339 independent variables but only 10 will be used at one
time . So at any given line of the data set there will be 10 non zero
entries for the independent variables and the rest will be zeros.

One more question:

1. I still want to find a way to look at the interactions of the
independent variables.

the regression would look like this:

y = b12*X1X2 + b23*X2X3 +...+ bk-1k*Xk-1Xk

so I think the regression in R would look like this:

lm(MARGIN, P235:P236+P236:P237+....,weights = Poss, data = adj0708),

my problem is that since I have technically 339 independent variables,
when I do this regression I would have 339 Choose 2 = approx 57000
independent variables (a vast majority will be 0s though) so I dont
want to have to write all of these out. Is there a way to do this
quickly in R?

Also just a curious question that I cant seem to find to online:
is there a more efficient model other than lm() that is better for
very sparse data sets like mine?

Thanks,
Matt


On Mon, Feb 28, 2011 at 4:30 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> Don't put the name of the dataset in the formula, use the data argument to lm to provide that.  A single period (".") on the right hand side of the formula will represent all the columns in the data set that are not on the left hand side (you can then use "-" to remove any other columns that you don't want included on the RHS).
>
> For example:
>
>> lm(Sepal.Width ~ . - Sepal.Length, data=iris)
>
> Call:
> lm(formula = Sepal.Width ~ . - Sepal.Length, data = iris)
>
> Coefficients:
>      (Intercept)       Petal.Length        Petal.Width  Speciesversicolor
>           3.0485             0.1547             0.6234            -1.7641
>  Speciesvirginica
>          -2.1964
>
>
> But, are you sure that a regression model with 339 predictors will be meaningful?
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Matthew Douglas
>> Sent: Monday, February 28, 2011 1:32 PM
>> To: r-help at r-project.org
>> Subject: [R] Regression with many independent variables
>>
>> Hi,
>>
>> I am trying use lm() on some data, the code works fine but I would
>> like to use a more efficient way to do this.
>>
>> The data looks like this (the data is very sparse with a few 1s, -1s
>> and the rest 0s):
>>
>> > head(adj0708)
>>       MARGIN Poss P235 P247 P703 P218 P430 P489 P83 P307 P337....
>> 1   64.28571   29    0    0    0    0    0    0   0    0    0    0
>> 0    0    0
>> 2 -100.00000    6    0    0    0    0    0    0   0    1    0    0
>> 0    0    0
>> 3  100.00000    4    0    0    0    0    0    0   0    1    0    0
>> 0    0    0
>> 4  -33.33333    7    0    0    0    0    0    0   0    0    0    0
>> 0    0    0
>> 5  200.00000    2    0    0    0    0    0    0   0    0    0    0
>> -1    0    0
>> 6  -83.33333   12    0    -1    0    0    0    0   0    0    0    0
>> 0    0    0
>>
>> adj0708 is actually a 35657x341 data set. Each column after "Poss" is
>> an independent variable, the dependent variable is "MARGIN" and it is
>> weighted by "Poss"
>>
>>
>> The regression is below:
>> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235 + adj0708$P247 +
>> adj0708$P703 + adj0708$P430 + adj0708$P489 + adj0708$P218 +
>> adj0708$P605 + adj0708$P337 + .... +
>> adj0708$P510,weights=adj0708$Poss)
>>
>> I have two questions:
>>
>> 1. Is there a way to to condense how I write the independent variables
>> in the lm(), instead of having such a long line of code (I have 339
>> independent variables to be exact)?
>> 2. I would like to pair the data to look a regression of the
>> interactions between two independent variables. I think it would look
>> something like this....
>> fit.adj0708 <- lm( adj0708$MARGIN~adj0708$P235:adj0708$P247 +
>> adj0708$P703:adj0708$P430 + adj0708$P489:adj0708$P218 +
>> adj0708$P605:adj0708$P337 + ....,weights=adj0708$Poss)
>> but there will be 339 Choose 2 combinations, so a lot of independent
>> variables! Is there a more efficient way of writing this code. Is
>> there a way I can do this?
>>
>> Thanks,
>> Matt
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list