[R] lm model with many categorical variables
sezenismail at gmail.com
Tue Sep 20 14:24:31 CEST 2016
> On 20 Sep 2016, at 11:34, Michael Haenlein <haenlein at escpeurope.eu> wrote:
> Dear all,
> I am trying to estimate a lm model with one continuous dependent variable
> and 11 independent variables that are all categorical, some of which have
> many categories (several dozens in some cases).
If I’m not wrong, ( I assume that categorical variables are in factor form) lm will pick the most crowded categories and will try to fit a linear model over them. (This might be wrong, please correct me somebody)
> I am not interested in statistical inference to a larger population. The
> objective of my model is to find a way to best predict my continuous
> variable within the sample.
The best pick would be a CART ( Classification and Reg. Tree, rpart) or CIT (Conditional Inference Tree, ctree) model to predict continous response variable by categorical variables. Please, see new partykit (old party) package for CIT.
> When I run the lm model I evidently get many regression coefficients that
> are not significant. Is there some way to automatically combine levels of a
> categorical variable together if the regression coefficients for the
> individual levels are not significant?
> My idea is to find some form of grouping of the different categories that
> allows me to work with less levels while keeping or even improving the
> quality of predictions.
I also want to mention cforest here, you can measure the importance of your predictor variables. I would recommend partykit package for categorical predictors, but also you can give it a try to rpart.
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help