[R] tapply bug? - levels of a factor in a data frame after tapply are intermixed

Fri Feb 13 19:09:23 CET 2009

on 02/13/2009 11:38 AM Dimitri Liakhovitski wrote:
> On Fri, Feb 13, 2009 at 12:24 PM, Marc Schwartz
> <marc_schwartz at comcast.net> wrote:
>> on 02/13/2009 11:09 AM Dimitri Liakhovitski wrote:
>>> Hello! I have encountered a really weird problem. Maybe you've
>>> encountered it before?
>>> I have a large data frame "importances". It has one factor ($A) with 3
>>> levels: 3, 9, and 15. $B is a regular numeric variable.
>>> Below I am picking a really small sub-frame (just 3 rows) based on
>>> "indices". "indices" were chosen so that all 3 levels of A are
>>> present:
>>>
>>> indices=c(14329,14209,14353)
>>> test=data.frame(yy=importances[["B']][indices],xx=importances[["A"]][indices])
>>> Here is what the new data frame "test" looks like:
>>>
>>>             yy        xx
>>> 1 -0.009984006  9
>>> 2 -2.339904131  3
>>> 3 -0.008427385 15
>>>
>>> Here is the structure of "test":
>>>> str(test)
>>> 'data.frame':   3 obs. of  2 variables:
>>>  $ yy: num  -0.00998 -2.3399 -0.00843
>>>  $ xx: Factor w/ 3 levels "3","9","15": 2 1 3
>>>
>>> Notice - the order of factor levels for xx is not 1 2 3 as it should
>>> be but 2 1 3. How come?
>>>
>>> Or also look at this:
>>>> test$xx
>>> [1] 9  3  15
>>> Levels: 3 9 15
>>>
>>> Same thing.
>>> Do you know what might be the reason?
>>>
>>> Thank you very much!
>> The output of str() is showing you the factor levels of test$xx,
>> followed by the internal integer codes for the three actual values of
>> test$xx, 9, 3, and 15:
>>
>>> str(test$xx)
>>  Factor w/ 3 levels "3","9","15": 2 1 3
>>
>>> levels(test$xx)
>> [1] "3"  "9"  "15"
>>
>>> as.integer(test$xx)
>> [1] 2 1 3
>>
>> 9 is the second level, hence the 2
>> 3 is the first level, hence the 1
>> 15 is the third level, hence the 3.
>>
>> No problems, just clarification needed on what you are seeing.
>>
>> Note that you do not reference anything above regarding tapply() as per
>> your subject line, though I suspect that I have an idea as to why you did...
>>
>> HTH,
>>
>> Marc Schwartz
>>
>>
> 
> Marc (and everyone), I expected it to show:
> $ xx: Factor w/ 3 levels "3","9","15":  1 2 3
> rather than what I am seeing:
> $ xx: Factor w/ 3 levels "3","9","15":  2 1 3
> Because 3 is level 1, 9 is level 2 and 15 is level 3.
> I have several other factors in my original data frame. And I've done
> that tapply for all of them (for the same dependent variable) - and in
> all of them the first level was 1, the second 2, etc.
> Why I am concerned about the problem? Because I am plotting the means
> of the numeric variable against the levels of the factor and it's
> important to me that the factor levels are correct (in the right
> order)...
> Dimitri

Dimitri,

The above examples that you have are the expected output given the data
that you provided, including the ordering of the explicit row indices
that you used.

If we create some sample data, using something along the lines of your
original description:

set.seed(1)
A <- sample(factor(c(3, 9, 15)), 100, replace = TRUE)

set.seed(2)
B <- rnorm(100)

DF <- data.frame(A = A, B = B)

> head(DF)
   A           B
1  3 -0.89691455
2  9  0.18484918
3  9  1.58784533
4 15 -1.13037567
5  3 -0.08025176
6 15  0.13242028

> str(DF)
'data.frame':	100 obs. of  2 variables:
 $ A: Factor w/ 3 levels "3","9","15": 1 2 2 3 1 3 3 2 2 1 ...
 $ B: num  -0.8969 0.1848 1.5878 -1.1304 -0.0803 ...

I then use tapply() to get the means:

> tapply(DF$B, list(A = DF$A), mean)
A
          3           9          15
 0.10620274  0.08577537 -0.26276438

The output is in the order one would expect. If you want something else,
then you may have to check the factor levels for 'A' and alter them to
the ordering that you actually want. For example:

DF$A <- factor(DF$A, levels = c("9", "3", "15"))

  or

levels(DF$A) <- c("9", "3", "15")

> str(DF)
'data.frame':	100 obs. of  2 variables:
 $ A: Factor w/ 3 levels "9","3","15": 2 1 1 3 2 3 3 1 1 2 ...
 $ B: num  -0.8969 0.1848 1.5878 -1.1304 -0.0803 ...

which would then adjust the ordering of the tapply() output to:

> tapply(DF$B, list(A = DF$A), mean)
A
          9           3          15
 0.08577537  0.10620274 -0.26276438

Is that perhaps what you are looking for?

Marc