[R] drop rare factors

Sam Steingold sds at gnu.org
Thu Jan 19 21:43:08 CET 2012


create data:

mydata <- data.frame(MyFactor = factor(rep(LETTERS[1:4], times=c(1000, 2000, 30, 4))), something = runif(3034))

define function:

drop.levels <- function (df, column, threshold) {
  size <- nrow(df)
  if (threshold < 1) threshold <- threshold * size
  tab <- table(df[column])
  keep <- names(tab)[tab >  threshold]
  drop <- names(tab)[tab <= threshold]
  cat("Keep(",column,")",length(keep),"\n"); print(tab[keep])
  cat("Drop(",column,")",length(drop),"\n"); print(tab[drop])
  str(df)
  df <- df[df[column] %in% keep, ]
  str(df)
  size1 <- nrow(df)
  cat("Rows:",size,"-->",size1,"(dropped",100*(size-size1)/size,"%)\n")
  df[column] <- factor(df[column], levels=keep)
  df
}

call the function on the data:

drop.levels(mydata,"MyFactor",5)
Keep( MyFactor ) 3 

   A    B    C 
1000 2000   30 
Drop( MyFactor ) 1 
D 
4 
'data.frame':	3034 obs. of  2 variables:
 $ MyFactor : Factor w/ 4 levels "A","B","C","D": 1 1 1 1 1 1 1 1 1 1 ...
 $ something: num  0.725 0.741 0.608 0.681 0.993 ...
'data.frame':	0 obs. of  2 variables:
 $ MyFactor : Factor w/ 4 levels "A","B","C","D": 
 $ something: num 
Rows: 3034 --> 0 (dropped 100 %)
Error in `[<-.data.frame`(`*tmp*`, column, value = NA_integer_) : 
  replacement has 1 rows, data has 0

----- why is there a blank line between "Keep( MyFactor ) 3" and "A    B    C"
 but no blank line between "Drop" and "D"?

----- why does "df[df[column] %in% keep, ]" empty out the data frame?

thanks!


> Remind the list what you're trying to do. The list gets lots of traffic;
> if you delete out all the context nobody will remember what you need.

Sorry, I assumed that people can easily access the parent messages.

-- 
Sam Steingold (http://sds.podval.org/) on Ubuntu 11.10 (oneiric) X 11.0.11004000
http://www.PetitionOnline.com/tap12009/ http://pmw.org.il
http://mideasttruth.com http://memri.org http://openvotingconsortium.org
"Syntactic sugar causes cancer of the semicolon."	-Alan Perlis



More information about the R-help mailing list