[R] Importance of levels in a factor variable

Saeed Abu Nimeh sabunime at gmail.com
Fri Aug 27 20:14:55 CEST 2010


Thanks Greg. Actually, we have 5000 levels and it is not an import
problem. I looked into combine.levels in the Hmisc package. The
problem with this approach is that it takes the frequency of levels,
then combines infrequent levels into one level called "Others". If you
apply this to the complete dataset (positive and negative samples),
and if the number of negative samples is much greater than the
positive ones, then most of the levels of the positive samples will go
into the "Others" level in the final result. Thats why I was wondering
if there is a more accurate way to remove the unimportant levels.

On Thu, Aug 26, 2010 at 3:47 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> A factor with 5000 levels looks like it may be a numeric variable that was accidently coded as a factor (functions like read.table will do this if there is a non numeric character in with the numbers).
>
> If you really have a 5000 level factor, which levels can be discarded or combined is a question for the subject specific scientist, not the statistician.
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Saeed Abu Nimeh
>> Sent: Thursday, August 26, 2010 1:40 PM
>> To: r-help at r-project.org
>> Subject: [R] Importance of levels in a factor variable
>>
>> I have a dataset of multiple variables and a response. For example,
>> > str(x)
>> 'data.frame':   3557238 obs. of  44 variables:
>>  $ response :  Factor w/ 2 levels
>>  $ var2: Factor w/5000 levels
>>
>>
>> If var2 for example is a factor with 5000 levels, what is the best
>> approach to determine which of these levels is the most important to
>> include in building the model, and which ones to discard. Assuming
>> there is a way to do that, is it accurate to only include the
>> important levels and discard the rest for that variable when building
>> the model.
>> Thansk,
>> Saeed
>>
>> ---
>> > sessionInfo()
>> R version 2.10.1 (2009-12-14)
>> x86_64-pc-linux-gnu
>> 32 GB RAM
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list