[R] tapply bug? - levels of a factor in a data frame after tapply are intermixed

Greg Snow Greg.Snow at imail.org
Fri Feb 13 19:13:33 CET 2009


It comes down to 2 simple rules:

1. If you don't care about the order of the factor levels, then it doesn't matter how R codes the relationship
2. If you do care about the order, then tell R what order you want.  

Consider the following:

> x <- c(9,3,15,9,15,9,3)
> factor(x)
[1] 9  3  15 9  15 9  3 
Levels: 3 9 15
> factor(as.character(x))
[1] 9  3  15 9  15 9  3 
Levels: 15 3 9
> factor(x, levels=unique(x))
[1] 9  3  15 9  15 9  3 
Levels: 9 3 15

The last looks most like what you want, but for many uses, all 3 will give equivalent results.

Hope this helps,

-- 
Gregory (Greg) L. Snow Ph.D.
Statistical Data Center
Intermountain Healthcare
greg.snow at imail.org
801.408.8111


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-
> project.org] On Behalf Of Dimitri Liakhovitski
> Sent: Friday, February 13, 2009 10:54 AM
> To: marc_schwartz at comcast.net
> Cc: R-Help List
> Subject: Re: [R] tapply bug? - levels of a factor in a data frame after
> tapply are intermixed
> 
> Sorry - one clarification:
> When I run:
> > test$xx - the what I am currently seeing is:
>  [1] 9  3  15
>  Levels: 3 9 15
> But what I am expecting to be seeing is:
>  [1] 9  3  15
>  Levels: 9 3 15
> Or maybe: Levels: 2 1 3
> 
> 
> On Fri, Feb 13, 2009 at 12:38 PM, Dimitri Liakhovitski
> <ld7631 at gmail.com> wrote:
> > On Fri, Feb 13, 2009 at 12:24 PM, Marc Schwartz
> > <marc_schwartz at comcast.net> wrote:
> >> on 02/13/2009 11:09 AM Dimitri Liakhovitski wrote:
> >>> Hello! I have encountered a really weird problem. Maybe you've
> >>> encountered it before?
> >>> I have a large data frame "importances". It has one factor ($A)
> with 3
> >>> levels: 3, 9, and 15. $B is a regular numeric variable.
> >>> Below I am picking a really small sub-frame (just 3 rows) based on
> >>> "indices". "indices" were chosen so that all 3 levels of A are
> >>> present:
> >>>
> >>> indices=c(14329,14209,14353)
> >>>
> test=data.frame(yy=importances[["B']][indices],xx=importances[["A"]][in
> dices])
> >>> Here is what the new data frame "test" looks like:
> >>>
> >>>             yy        xx
> >>> 1 -0.009984006  9
> >>> 2 -2.339904131  3
> >>> 3 -0.008427385 15
> >>>
> >>> Here is the structure of "test":
> >>>> str(test)
> >>> 'data.frame':   3 obs. of  2 variables:
> >>>  $ yy: num  -0.00998 -2.3399 -0.00843
> >>>  $ xx: Factor w/ 3 levels "3","9","15": 2 1 3
> >>>
> >>> Notice - the order of factor levels for xx is not 1 2 3 as it
> should
> >>> be but 2 1 3. How come?
> >>>
> >>> Or also look at this:
> >>>> test$xx
> >>> [1] 9  3  15
> >>> Levels: 3 9 15
> >>>
> >>> Same thing.
> >>> Do you know what might be the reason?
> >>>
> >>> Thank you very much!
> >>
> >> The output of str() is showing you the factor levels of test$xx,
> >> followed by the internal integer codes for the three actual values
> of
> >> test$xx, 9, 3, and 15:
> >>
> >>> str(test$xx)
> >>  Factor w/ 3 levels "3","9","15": 2 1 3
> >>
> >>> levels(test$xx)
> >> [1] "3"  "9"  "15"
> >>
> >>> as.integer(test$xx)
> >> [1] 2 1 3
> >>
> >> 9 is the second level, hence the 2
> >> 3 is the first level, hence the 1
> >> 15 is the third level, hence the 3.
> >>
> >> No problems, just clarification needed on what you are seeing.
> >>
> >> Note that you do not reference anything above regarding tapply() as
> per
> >> your subject line, though I suspect that I have an idea as to why
> you did...
> >>
> >> HTH,
> >>
> >> Marc Schwartz
> >>
> >>
> >
> > Marc (and everyone), I expected it to show:
> > $ xx: Factor w/ 3 levels "3","9","15":  1 2 3
> > rather than what I am seeing:
> > $ xx: Factor w/ 3 levels "3","9","15":  2 1 3
> > Because 3 is level 1, 9 is level 2 and 15 is level 3.
> > I have several other factors in my original data frame. And I've done
> > that tapply for all of them (for the same dependent variable) - and
> in
> > all of them the first level was 1, the second 2, etc.
> > Why I am concerned about the problem? Because I am plotting the means
> > of the numeric variable against the levels of the factor and it's
> > important to me that the factor levels are correct (in the right
> > order)...
> > Dimitri
> >
> >
> > --
> > Dimitri Liakhovitski
> > MarketTools, Inc.
> > Dimitri.Liakhovitski at markettools.com
> >
> 
> 
> 
> --
> Dimitri Liakhovitski
> MarketTools, Inc.
> Dimitri.Liakhovitski at markettools.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list