[R] how to collapse categories or re-categorize variables?

Henric Winell nilsson.henric at gmail.com
Mon Jul 19 19:06:51 CEST 2010


On 2010-07-17 23:03, Peter Dalgaard wrote:
> Ista Zahn wrote:
>> Hi,
>> On Fri, Jul 16, 2010 at 5:18 PM, CC <turtysmail at gmail.com> wrote:
>>> I am sure this is a very basic question:
>>>
>>> I have 600,000 categorical variables in a data.frame - each of which is
>>> classified as "0", "1", or "2"
>>>
>>> What I would like to do is collapse "1" and "2" and leave "0" by itself,
>>> such that after re-categorizing "0" = "0"; "1" = "1" and "2" = "1" --- in
>>> the end I only want "0" and "1" as categories for each of the variables.
>> Something like this should work
>>
>> for (i in names(dat)) {
>> dat[, i]  <- factor(dat[, i], levels = c("0", "1", "2"), labels =
>> c("0", "1", "1))
>> }
> 
> Unfortunately, it won't:
> 
>> d <- 0:2
>> factor(d, levels=c(0,1,1))
> [1] 0    1    <NA>
> Levels: 0 1 1
> Warning message:
> In `levels<-`(`*tmp*`, value = c("0", "1", "1")) :
>   duplicated levels will not be allowed in factors anymore
> 
> 
> This effect, I have been told, goes way back to design choices in S
> (that you can have repeated level names) plus compatibility ever since.
> 
> It would make more sense if it behaved like
> 
> d <- factor(d); levels(d) <- c(0,1,1)
> 
> and maybe, some time in the future, it will. Meanwhile, the above is the
> workaround.
> 
> (BTW, if there are 600000 variables, you probably don't want to iterate
> over their names, more likely "for(i in seq_along(dat))...")

You could also use 'lapply' with 'levels<-':

 > ### Example data
 > set.seed(1)
 > d <- 0:2
 > DF <- data.frame(X1 = factor(sample(d, size = 10, replace = TRUE)),
+                  X2 = factor(sample(d, size = 10, replace = TRUE)))
 > DF
    X1 X2
1   0  0
2   1  0
3   1  2
4   2  1
5   0  2
6   2  1
7   2  2
8   1  2
9   1  1
10  0  2
 >
 > ### Reorder levels and replace
 > DF[] <- lapply(DF, function(x) "levels<-"(x, c("0", "1", "1")))
 > DF
    X1 X2
1   0  0
2   1  0
3   1  1
4   1  1
5   0  1
6   1  1
7   1  1
8   1  1
9   1  1
10  0  1


HTH,
Henric



More information about the R-help mailing list