[R] tapply bug? - levels of a factor in a data frame after tapply are intermixed

Dimitri Liakhovitski ld7631 at gmail.com
Fri Feb 13 19:26:01 CET 2009


Both Greg and Marc - thank you so much!

It helped a lot. What I just discovered also works (similar to Greg's
suggestions) is to make it first a character and THEN to do:
as.factor(as.numeric(original character vector))).

Wow! R never stops surprizing one - and I am just in the beginning of
the journey!
Thank you!
Dimitri



On Fri, Feb 13, 2009 at 1:13 PM, Greg Snow <Greg.Snow at imail.org> wrote:
> It comes down to 2 simple rules:
>
> 1. If you don't care about the order of the factor levels, then it doesn't matter how R codes the relationship
> 2. If you do care about the order, then tell R what order you want.
>
> Consider the following:
>
>> x <- c(9,3,15,9,15,9,3)
>> factor(x)
> [1] 9  3  15 9  15 9  3
> Levels: 3 9 15
>> factor(as.character(x))
> [1] 9  3  15 9  15 9  3
> Levels: 15 3 9
>> factor(x, levels=unique(x))
> [1] 9  3  15 9  15 9  3
> Levels: 9 3 15
>
> The last looks most like what you want, but for many uses, all 3 will give equivalent results.
>
> Hope this helps,
>
> --
> Gregory (Greg) L. Snow Ph.D.
> Statistical Data Center
> Intermountain Healthcare
> greg.snow at imail.org
> 801.408.8111
>
>
>> -----Original Message-----
>> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
>> project.org] On Behalf Of Dimitri Liakhovitski
>> Sent: Friday, February 13, 2009 10:54 AM
>> To: marc_schwartz at comcast.net
>> Cc: R-Help List
>> Subject: Re: [R] tapply bug? - levels of a factor in a data frame after
>> tapply are intermixed
>>
>> Sorry - one clarification:
>> When I run:
>> > test$xx - the what I am currently seeing is:
>>  [1] 9  3  15
>>  Levels: 3 9 15
>> But what I am expecting to be seeing is:
>>  [1] 9  3  15
>>  Levels: 9 3 15
>> Or maybe: Levels: 2 1 3
>>
>>
>> On Fri, Feb 13, 2009 at 12:38 PM, Dimitri Liakhovitski
>> <ld7631 at gmail.com> wrote:
>> > On Fri, Feb 13, 2009 at 12:24 PM, Marc Schwartz
>> > <marc_schwartz at comcast.net> wrote:
>> >> on 02/13/2009 11:09 AM Dimitri Liakhovitski wrote:
>> >>> Hello! I have encountered a really weird problem. Maybe you've
>> >>> encountered it before?
>> >>> I have a large data frame "importances". It has one factor ($A)
>> with 3
>> >>> levels: 3, 9, and 15. $B is a regular numeric variable.
>> >>> Below I am picking a really small sub-frame (just 3 rows) based on
>> >>> "indices". "indices" were chosen so that all 3 levels of A are
>> >>> present:
>> >>>
>> >>> indices=c(14329,14209,14353)
>> >>>
>> test=data.frame(yy=importances[["B']][indices],xx=importances[["A"]][in
>> dices])
>> >>> Here is what the new data frame "test" looks like:
>> >>>
>> >>>             yy        xx
>> >>> 1 -0.009984006  9
>> >>> 2 -2.339904131  3
>> >>> 3 -0.008427385 15
>> >>>
>> >>> Here is the structure of "test":
>> >>>> str(test)
>> >>> 'data.frame':   3 obs. of  2 variables:
>> >>>  $ yy: num  -0.00998 -2.3399 -0.00843
>> >>>  $ xx: Factor w/ 3 levels "3","9","15": 2 1 3
>> >>>
>> >>> Notice - the order of factor levels for xx is not 1 2 3 as it
>> should
>> >>> be but 2 1 3. How come?
>> >>>
>> >>> Or also look at this:
>> >>>> test$xx
>> >>> [1] 9  3  15
>> >>> Levels: 3 9 15
>> >>>
>> >>> Same thing.
>> >>> Do you know what might be the reason?
>> >>>
>> >>> Thank you very much!
>> >>
>> >> The output of str() is showing you the factor levels of test$xx,
>> >> followed by the internal integer codes for the three actual values
>> of
>> >> test$xx, 9, 3, and 15:
>> >>
>> >>> str(test$xx)
>> >>  Factor w/ 3 levels "3","9","15": 2 1 3
>> >>
>> >>> levels(test$xx)
>> >> [1] "3"  "9"  "15"
>> >>
>> >>> as.integer(test$xx)
>> >> [1] 2 1 3
>> >>
>> >> 9 is the second level, hence the 2
>> >> 3 is the first level, hence the 1
>> >> 15 is the third level, hence the 3.
>> >>
>> >> No problems, just clarification needed on what you are seeing.
>> >>
>> >> Note that you do not reference anything above regarding tapply() as
>> per
>> >> your subject line, though I suspect that I have an idea as to why
>> you did...
>> >>
>> >> HTH,
>> >>
>> >> Marc Schwartz
>> >>
>> >>
>> >
>> > Marc (and everyone), I expected it to show:
>> > $ xx: Factor w/ 3 levels "3","9","15":  1 2 3
>> > rather than what I am seeing:
>> > $ xx: Factor w/ 3 levels "3","9","15":  2 1 3
>> > Because 3 is level 1, 9 is level 2 and 15 is level 3.
>> > I have several other factors in my original data frame. And I've done
>> > that tapply for all of them (for the same dependent variable) - and
>> in
>> > all of them the first level was 1, the second 2, etc.
>> > Why I am concerned about the problem? Because I am plotting the means
>> > of the numeric variable against the levels of the factor and it's
>> > important to me that the factor levels are correct (in the right
>> > order)...
>> > Dimitri
>> >
>> >
>> > --
>> > Dimitri Liakhovitski
>> > MarketTools, Inc.
>> > Dimitri.Liakhovitski at markettools.com
>> >
>>
>>
>>
>> --
>> Dimitri Liakhovitski
>> MarketTools, Inc.
>> Dimitri.Liakhovitski at markettools.com
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-
>> guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Dimitri Liakhovitski
MarketTools, Inc.
Dimitri.Liakhovitski at markettools.com




More information about the R-help mailing list