[R] subset a data frame by largest frequencies of factors

David L Carlson dcarlson at tamu.edu
Thu Mar 5 21:15:49 CET 2015


These two commands will compute the cell frequencies and then sort them:

e <- as.data.frame(xtabs(~ctry+member, Dataset))
f <- e[order(e$Freq, decreasing=TRUE),]

Then draw your subset

g <- head(f, 10)

or

g <- f[cumsum(f$Freq)/sum(f$Freq) >.8,]

Finally merge the sample with the original data and delete the unused factor levels:

sample <- merge(Dataset, g[,-3])
sample$ctry <- factor(sample$ctry)
sample$member <- factor(sample$member)

-------------------------------------
David L Carlson
Department of Anthropology
Texas A&M University
College Station, TX 77840-4352


-----Original Message-----
From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of Michael Friendly
Sent: Thursday, March 5, 2015 12:45 PM
To: R-help
Subject: [R] subset a data frame by largest frequencies of factors

A consulting client has a large data set with a binary response 
(negative) and two factors (ctry and member) which have many levels,
but many occur with very small frequencies.  It is far too sparse with a 
model like glm(negative ~ ctry+member, family=binomial).

 > str(Dataset)
'data.frame':   10672 obs. of  5 variables:
  $ ctry    : Factor w/ 31 levels "Barbados","Belize",..: 21 21 5 22 18 
18 18 18 26 18 ...
  $ member  : Factor w/ 163 levels "","ADHOPIA, PREETI ",..: 150 19 19 
111 120 1 1 4 55 18 ...
  $ negative: int  0 1 0 1 1 1 1 0 0 0 ...
 >

For analysis, we'd like to subset the data to include only those that 
occur with frequency greater than a given
value, or the top 10 (say) in frequency, or the highest frequency 
categories accounting for 80% (say) of the
total.  I'm not sure how to do any of these in R.  Can anyone help?

-- 
Michael Friendly     Email: friendly AT yorku DOT ca
Professor, Psychology Dept. & Chair, Quantitative Methods
York University      Voice: 416 736-2100 x66249 Fax: 416 736-5814
4700 Keele Street    Web:http://www.datavis.ca
Toronto, ONT  M3J 1P3 CANADA

______________________________________________
R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list