[R] drop rare factors

Sarah Goslee sarah.goslee at gmail.com
Wed Jan 18 23:36:16 CET 2012


Here's one way, worked out in lots of steps so you can see
how each works:

> mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something = runif(3034))
> str(mydata)
'data.frame':	3034 obs. of  2 variables:
 $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
 $ something: num  0.725 0.222 0.347 0.614 0.968 ...
>
> table(mydata$MyFactor)

   A    B    C    D
1000 2000   30    4
>
>
> important.levels <- table(mydata$MyFactor) / nrow(mydata)
> important.levels <- names(important.levels)[important.levels > .01]
> important.levels
[1] "A" "B"
>
> newdata <- mydata[mydata$MyFactor %in% important.levels, ]
> table(newdata$MyFactor)

   A    B    C    D
1000 2000    0    0
>
>
> newdata$MyFactor <- factor(newdata$MyFactor, levels=important.levels)
> table(newdata$MyFactor)

   A    B
1000 2000
>


On Wed, Jan 18, 2012 at 5:25 PM, Sam Steingold <sds at gnu.org> wrote:
> I have a data frame with some factor columns.
> I want to drop the rows with rare factor values
> (and remove the factor values from the factors).
> E.g.,  frame$MyFactor takes values
> A 1,000 times,
> B 2,000 times,
> C 30 times and
> D 4 times.
> I want to remove all rows which assume rare values (<1%), i.e., C and D.
> i.e.,
> frame <- frame[[! (frame$MyFactor %in% c("A","B"))]]
> except that I probably got the syntax wrong
> and I want c("A","B") to be generated automatically from frame$MyFactor
> and the number 0.01 (1%).
>
> Thanks!

-- 
Sarah Goslee
http://www.functionaldiversity.org



More information about the R-help mailing list