[R] Importance of levels in a factor variable

Greg Snow Greg.Snow at imail.org
Fri Aug 27 00:47:30 CEST 2010


A factor with 5000 levels looks like it may be a numeric variable that was accidently coded as a factor (functions like read.table will do this if there is a non numeric character in with the numbers).

If you really have a 5000 level factor, which levels can be discarded or combined is a question for the subject specific scientist, not the statistician.

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Saeed Abu Nimeh
> Sent: Thursday, August 26, 2010 1:40 PM
> To: r-help at r-project.org
> Subject: [R] Importance of levels in a factor variable
> 
> I have a dataset of multiple variables and a response. For example,
> > str(x)
> 'data.frame':   3557238 obs. of  44 variables:
>  $ response :  Factor w/ 2 levels
>  $ var2: Factor w/5000 levels
> 
> 
> If var2 for example is a factor with 5000 levels, what is the best
> approach to determine which of these levels is the most important to
> include in building the model, and which ones to discard. Assuming
> there is a way to do that, is it accurate to only include the
> important levels and discard the rest for that variable when building
> the model.
> Thansk,
> Saeed
> 
> ---
> > sessionInfo()
> R version 2.10.1 (2009-12-14)
> x86_64-pc-linux-gnu
> 32 GB RAM
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list