[R] Hierarchical factors

Ista Zahn istazahn at gmail.com
Mon May 3 19:22:59 CEST 2010


Hi Marshall,
I'm not aware of any packages that implement these features as you
described them. But most of the tasks are already fairly easy in R --
see below.
On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <marsh at uri.edu> wrote:
>
> Thanks for getting back so quickly Ista,
>
> I was actually casting about for any examples of R software that deals with this kind of structure. But your question is a good one. Here are a few things I'd like to be able to do:
>
> Store data in R at the finest level of detail but easily refer to higher levels of aggregation. If the data include such higher levels, this is trivial, but otherwise I'd like to aggregate fairly easily. The following is not functioning code, but it should give you the idea:
>
> start with a data frame (call it d) having row.names = to the 6 digit NAICS code and columns w/ various variables, assume one is named employment.
> d[,"employment"]                       # Would print all employment data
> d["441222","employment"]        # Would print only Boat Dealer employment
> d["44","employment]                 # Would print total employment for Retail Trade


d[,"employment"] #prints all employment data
d[rownames(d) == "441222","employment"] #prints only boat dealer employment
d[grep("^44", rownames(d)),"employment"] # prints total employment for
retail trade

>
> Recursive nesting. I'm not sure how to convey this except with examples. Suppose the data frame also has a "wages" column with average weekly wages in the industry, and the industry code is also a factor variable (industry). So a simple analysis of variance might look like:
>
>                     w <- aov(wages ~ industry, d)
>
>         But now what I'd like to do is to break this down within 2-digit sectors. Assuming the data frame has another variable, industry 2, this would look like:
>
>                     w <- aov(wages ~ industry2/industry)
>
>          But what if we either (a) don't want to bother creating separate variables for each level of aggregation in industry or (b) want to extended the model formula language to include various nesting strategies. This might look like:
>
>                     w <- aov(wages ~ industry//*)                    # Nest all meaningful levels industry/industry2/industry3/industry4/industry5/industry6. If the coding system skips some levels, R is smart enough to omit the skipped levels.
>                     w <- aov(wages ~ industry//levels 2,4,6)     # I'm using "//" as a hypothetical extension to the model language that is followed by a "levels" keyword and then a list of levels within the hierarchy. This example would expand
>                                                                                        # to aov(wages ~ industry2/industry4/industry6)
>
>         One could extend this last example to include a notation allowing the analysis to be repeated at varying levels of depth (e.g., industry||2,6) would repeat the ANOVA for industry2 and industry6)
>

I can see how that might be useful. But it is easy enough to split the
variables out, for example (assuming that each level consists of two
digits):

  d$ind1 <- substr(rownames(d), 1,2)
  d$ind2 <- substr(rownames(d), 3,4)
  d$ind2 <- substr(rownames(d), 5,6)


> Since the factor hierarchy is completely nested (i.e., every 6-digit industry is below a 5 digit industry), a single function can operate on the codes recursively. Three variants come to mind. In the first, we'd use some kind of apply function to drill down to a certain level and return a list of results, one for each level:
>
>                   means <- drill(wages,industry,mean)                        # Would return a list. The first component would a vector of mean wages for industries at the 2-digit level, the second, a vector for the 3-digit level, etc.
>                   means <- drill(wages,industry,mean,maxlvl=3)         # Would stop at the 3rd level of the hierarchy (4-digit code). One could also imagine a maxdigits optionas an alternative (maxdigits = y means stop at the y-digit level)
>

Again, I can see how this would be useful, but it's already pretty
easy (once we have split out the grouping variables) to do something
like

grp.means <- list(
l1 = aggregate(d$wages, list(d$ind1), mean),
l2 = aggregate(d$wages, list(d$ind2), mean),
l3 = aggregate(d$wages, list(d$ind3), mean)
)

I know this wasn't what you were looking for (as I said, I'm not aware
of any package that implements the functionality you describe). But
the existing facilities in R are quite flexible, and handling this
kind of data in R is already fairly straightforward.

Best,
Ista

> Second, suppose we have a data frame like d, only this time it's a time series (each row is a different date). Now we might want to generate vectors of the rate of change in employment at each industry level. It might look like:
>
>     rate <- function(x) { (x - lag(x))/lag(x)) }
>     rates <- as.list()
>     i <- 1
>     rates <- for j %in% levels(industry)  {                                                # The levels function parses the hierarchical factor into the various levels of its coding system
>                     rates[[i]] <- rate(emplyment[,level(industry) == j])             # The level function sets a particular one of these levels
>                     i <- i + 1
>                 }
>
> A third variant would be a genuinely recursive function that keeps on calling itself at each level of the factor until it has either reached a pre-specified depth or exhausted all levels of the factor.
>
> I hope this gives you a good idea of the sorts of things one might do with hierarchical factors.
>
>     Marsh Feldman
>
>
>
> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>
> Hi Marshell,
> What exactly do you mean by "handles this kind of data structure"?
> What do you want R to do?
>
> Best,
> Ista
>
> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman <marsh at uri.edu> wrote:
>
>
> Hello,
>
> Hierarchical factors are a very common data structure. For instance, one
> might have municipalities within states within countries within
> continents. Other examples include occupational codes, biological
> species, software types (R within statistical software within analytical
> software), etc.
>
> Such data structures commonly use hierarchical coding systems. For
> example, the 2007 North American Industry Classification System (NAICS)
> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has twenty
> two-digit codes (e.g., 42 = Wholesale trade), within each of these
> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
> durable goods), then varying numbers of 4-digit codes (4231 = Motor
> Vehicle and Motor Vehicle Parts and Supplies Merchant Wholesalers), then
> varying numbers of five-digit codes, varying numbers of six-digit codes,
> etc. At the lowest level (longest code) one can readily tell all the
> higher levels. For example, 441222 is "Boat Dealers" who are part of
> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which is
> part of 4412 (Other Motor Vehicle Dealers), which is part of 441 (Motor
> Vehicle and Parts Dealers), which is part of 44 (Retail Trade). (The US
> Census Bureau has extended the 6-digit NAICS to an even more
> fine-grained 10-digit system.)
>
> I haven't seen any R packages or sample code that handles this kind of
> data, but I don't want to reinvent the wheel and would rather stand on
> the shoulders of you giants. Is there any package or other R-based
> software out there that handles this kind of data structure?
>
>     Thanks,
>     Marsh Feldman
>
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
>
>
> --
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
> Contact Information:
>
> Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
> Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464



--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org



More information about the R-help mailing list