[R] Hierarchical factors
David Winsemius
dwinsemius at comcast.net
Thu May 6 05:29:17 CEST 2010
I think you are perhaps unintentionally obscuring two issues. One is
whether R might have the statistical functions to deal with such an
arrangement, and here "mixed models" would be the phrase you ought to
be watching for, while the other would be whether it would have pre-
written data management functions that would directly support the
particular data layout you might be getting from public-access gov't
files. The second is what I _thought_ you were soliciting in your
original posting. I was a bit surprised that no one mentioned the
survey package, since I have seen it used in such situations, but I
cannot track down the citation at the moment. You might want to look
at Gelman's blogs:
http://www.stat.columbia.edu/~cook/movabletype/archives/2009/07/my_class_on_sur.html
See also work on nested case within cohort desgns:
http://aje.oxfordjournals.org/cgi/content/full/kwp055v1
And Damico's article:
"Transitioning to R: Replicating SAS, Stata, and SUDAAN Analysis
Techniques in Health Policy Data"
R Journal, 2002 , n 2.
http://journal.r-project.org/archive/2009-2/RJournal_2009-2_Damico.pdf
--
David.
On May 5, 2010, at 10:23 PM, Marshall Feldman wrote:
> Thanks for sharing this, Ista.
>
> I've come to the conclusion that R doesn't have what I'm looking for,
> either in the base or the packages.
>
> Although your examples are insightful, the examples we've been
> discussing are deliberately easier than what one would expect in most
> serious applications. Imagine for instance that we're studying wage
> structures of industries in different geographic labor markets. We
> therefore might have four variables: wages, industries, occupations,
> and
> places. We might want to see if wage differentials are more or less
> constant or if they are higher in some geographic areas than in
> others.
> Since industries, occupations, and places are typically coded
> hierarchically as we've been discussing, we might want to figure out
> how
> to examine different wage levels within industries, etc. Doing this
> manually would require lots of w
> whereas conceptually the
>
> On 5/4/2010 6:00 AM,
>> Message: 49 Date: Mon, 3 May 2010 13:22:59 -0400 From: Ista Zahn
>> <istazahn at gmail.com> To: Marshall Feldman <marsh at uri.edu> Cc:
>> r-help at r-project.org Subject: Re: [R] Hierarchical factors Message-
>> ID:
>> <x2xf55e7cf51005031022se4c46967s174efeef95331abc at mail.gmail.com>
>> Content-Type: text/plain; charset=ISO-8859-1 Hi Marshall, I'm not
>> aware of any packages that implement these features as you described
>> them. But most of the tasks are already fairly easy in R -- see
>> below.
>> On Mon, May 3, 2010 at 11:18 AM, Marshall Feldman <marsh at uri.edu>
>> wrote:
>>>>
>>>> Thanks for getting back so quickly Ista,
>>>>
>>>> I was actually casting about for any examples of R software that
>>>> deals with this kind of structure. But your question is a good
>>>> one. Here are a few things I'd like to be able to do:
>>>>
>>>> Store data in R at the finest level of detail but easily refer to
>>>> higher levels of aggregation. If the data include such higher
>>>> levels, this is trivial, but otherwise I'd like to aggregate
>>>> fairly easily. The following is not functioning code, but it
>>>> should give you the idea:
>>>>
>>>> start with a data frame (call it d) having row.names = to the 6
>>>> digit NAICS code and columns w/ various variables, assume one is
>>>> named employment.
>>>> d[,"employment"]??? ??? ??? ??? ??? ?? # Would print all
>>>> employment data
>>>> d["441222","employment"]??? ??? # Would print only Boat Dealer
>>>> employment
>>>> d["44","employment]??? ??? ??? ???? # Would print total
>>>> employment for Retail Trade
>>>
>> d[,"employment"] #prints all employment data
>> d[rownames(d) == "441222","employment"] #prints only boat dealer
>> employment
>> d[grep("^44", rownames(d)),"employment"] # prints total employment
>> for
>> retail trade
>>
>>
>>>>
>>>> Recursive nesting. I'm not sure how to convey this except with
>>>> examples. Suppose the data frame also has a "wages" column with
>>>> average weekly wages in the industry, and the industry code is
>>>> also a factor variable (industry). So a simple analysis of
>>>> variance might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry, d)
>>>>
>>>> ??? ??? But now what I'd like to do is to break this down within
>>>> 2-digit sectors. Assuming the data frame has another variable,
>>>> industry 2, this would look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry2/industry)
>>>>
>>>> ???? ??? But what if we either (a) don't want to bother creating
>>>> separate variables for each level of aggregation in industry or
>>>> (b) want to extended the model formula language to include
>>>> various nesting strategies. This might look like:
>>>>
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry//
>>>> *)??? ??? ??? ??? ??? # Nest all meaningful levels industry/
>>>> industry2/industry3/industry4/industry5/industry6. If the coding
>>>> system skips some levels, R is smart enough to omit the skipped
>>>> levels.
>>>> ??? ??? ??? ??? ??? w<- aov(wages ~ industry//levels 2,4,6)???? #
>>>> I'm using "//" as a hypothetical extension to the model language
>>>> that is followed by a "levels" keyword and then a list of levels
>>>> within the hierarchy. This example would expand
>>>> ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ??? ?? # to
>>>> aov(wages ~ industry2/industry4/industry6)
>>>>
>>>> ??? ??? One could extend this last example to include a notation
>>>> allowing the analysis to be repeated at varying levels of depth
>>>> (e.g., industry||2,6) would repeat the ANOVA for industry2 and
>>>> industry6)
>>>>
>>>
>> I can see how that might be useful. But it is easy enough to split
>> the
>> variables out, for example (assuming that each level consists of two
>> digits):
>>
>> d$ind1<- substr(rownames(d), 1,2)
>> d$ind2<- substr(rownames(d), 3,4)
>> d$ind2<- substr(rownames(d), 5,6)
>>
>>
>>
>>>> Since the factor hierarchy is completely nested (i.e., every 6-
>>>> digit industry is below a 5 digit industry), a single function
>>>> can operate on the codes recursively. Three variants come to
>>>> mind. In the first, we'd use some kind of apply function to drill
>>>> down to a certain level and return a list of results, one for
>>>> each level:
>>>>
>>>> ??? ??? ??? ??? ? means<-
>>>> drill(wages,industry,mean)??? ??? ??? ??? ??? ??? # Would return
>>>> a list. The first component would a vector of mean wages for
>>>> industries at the 2-digit level, the second, a vector for the 3-
>>>> digit level, etc.
>>>> ??? ??? ??? ??? ? means<-
>>>> drill(wages,industry,mean,maxlvl=3)??? ???? # Would stop at the
>>>> 3rd level of the hierarchy (4-digit code). One could also imagine
>>>> a maxdigits optionas an alternative (maxdigits = y means stop at
>>>> the y-digit level)
>>>>
>>>
>> Again, I can see how this would be useful, but it's already pretty
>> easy (once we have split out the grouping variables) to do something
>> like
>>
>> grp.means<- list(
>> l1 = aggregate(d$wages, list(d$ind1), mean),
>> l2 = aggregate(d$wages, list(d$ind2), mean),
>> l3 = aggregate(d$wages, list(d$ind3), mean)
>> )
>>
>> I know this wasn't what you were looking for (as I said, I'm not
>> aware
>> of any package that implements the functionality you describe). But
>> the existing facilities in R are quite flexible, and handling this
>> kind of data in R is already fairly straightforward.
>>
>> Best,
>> Ista
>>
>>
>>>> Second, suppose we have a data frame like d, only this time it's
>>>> a time series (each row is a different date). Now we might want
>>>> to generate vectors of the rate of change in employment at each
>>>> industry level. It might look like:
>>>>
>>>> ??? rate<- function(x) { (x - lag(x))/lag(x)) }
>>>> ??? rates<- as.list()
>>>> ??? i<- 1
>>>> ??? rates<- for j %in% levels(industry)?
>>>> {?? ??? ??? ??? ??? ??? ??? ??? ? ?? ??? ??? ??? # The levels
>>>> function parses the hierarchical factor into the various levels
>>>> of its coding system
>>>> ??? ??? ??? ??? ??? rates[[i]]<- rate(emplyment[,level(industry)
>>>> == j])??? ??? ???? # The level function sets a particular one of
>>>> these levels
>>>> ??? ??? ??? ??? ??? i<- i + 1
>>>> ??? ??? ??? ??? }
>>>>
>>>> A third variant would be a genuinely recursive function that
>>>> keeps on calling itself at each level of the factor until it has
>>>> either reached a pre-specified depth or exhausted all levels of
>>>> the factor.
>>>>
>>>> I hope this gives you a good idea of the sorts of things one
>>>> might do with hierarchical factors.
>>>>
>>>> ??? Marsh Feldman
>>>>
>>>>
>>>>
>>>> On 5/3/2010 9:57 AM, Ista Zahn wrote:
>>>>
>>>> Hi Marshell,
>>>> What exactly do you mean by "handles this kind of data structure"?
>>>> What do you want R to do?
>>>>
>>>> Best,
>>>> Ista
>>>>
>>>> On Mon, May 3, 2010 at 9:44 AM, Marshall Feldman<marsh at uri.edu>
>>>> wrote:
>>>>
>>>>
>>>> Hello,
>>>>
>>>> Hierarchical factors are a very common data structure. For
>>>> instance, one
>>>> might have municipalities within states within countries within
>>>> continents. Other examples include occupational codes, biological
>>>> species, software types (R within statistical software within
>>>> analytical
>>>> software), etc.
>>>>
>>>> Such data structures commonly use hierarchical coding systems. For
>>>> example, the 2007 North American Industry Classification System
>>>> (NAICS)
>>>> <http://www.census.gov/cgi-bin/sssd/naics/naicsrch?chart=2007>has
>>>> twenty
>>>> two-digit codes (e.g., 42 = Wholesale trade), within each of these
>>>> varying numbers of 3-digit codes (e.g., 423 = Merchant wholesalers,
>>>> durable goods), then varying numbers of 4-digit codes (4231 = Motor
>>>> Vehicle and Motor Vehicle Parts and Supplies Merchant
>>>> Wholesalers), then
>>>> varying numbers of five-digit codes, varying numbers of six-digit
>>>> codes,
>>>> etc. At the lowest level (longest code) one can readily tell all
>>>> the
>>>> higher levels. For example, 441222 is "Boat Dealers" who are part
>>>> of
>>>> 44122, "Motorcycle, Boat, and Other Motor Vehicle Dealers," which
>>>> is
>>>> part of 4412 (Other Motor Vehicle Dealers), which is part of 441
>>>> (Motor
>>>> Vehicle and Parts Dealers), which is part of 44 (Retail Trade).
>>>> (The US
>>>> Census Bureau has extended the 6-digit NAICS to an even more
>>>> fine-grained 10-digit system.)
>>>>
>>>> I haven't seen any R packages or sample code that handles this
>>>> kind of
>>>> data, but I don't want to reinvent the wheel and would rather
>>>> stand on
>>>> the shoulders of you giants. Is there any package or other R-based
>>>> software out there that handles this kind of data structure?
>>>>
>>>> ? ? Thanks,
>>>> ? ? Marsh Feldman
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> ? ? ? ?[[alternative HTML version deleted]]
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Dr. Marshall Feldman, PhD
>>>> Director of Research and Academic Affairs
>>>> Center for Urban Studies and Research
>>>> The University of Rhode Island
>>>> email: marsh @ uri .edu (remove spaces)
>>>>
>>>> Contact Information:
>>>>
>>>> Kingston:
>>>>
>>>> 202 Hart House
>>>> Charles T. Schmidt Labor Research Center
>>>> The University of Rhode Island
>>>> 36 Upper College Road
>>>> Kingston, RI 02881-0815
>>>> tel. (401) 874-5953:
>>>> fax: (401) 874-5511
>>>>
>>>> Providence:
>>>>
>>>> 206E Shepard Building
>>>> URI Feinstein Providence Campus
>>>> 80 Washington Street
>>>> Providence, RI 02903-1819
>>>> tel. (401) 277-5218
>>>> fax: (401) 277-5464
>>>
>>
>> --
>> Ista Zahn
>> Graduate student
>> University of Rochester
>> Department of Clinical and Social Psychology
>> http://yourpsyche.org
>>
>
> --
> Dr. Marshall Feldman, PhD
> Director of Research and Academic Affairs
> CUSR Logo
> Center for Urban Studies and Research
> The University of Rhode Island
> email: marsh @ uri .edu (remove spaces)
>
>
> Contact Information:
>
>
> Kingston:
>
> 202 Hart House
> Charles T. Schmidt Labor Research Center
> The University of Rhode Island
> 36 Upper College Road
> Kingston, RI 02881-0815
> tel. (401) 874-5953:
> fax: (401) 874-5511
>
>
> Providence:
>
> 206E Shepard Building
> URI Feinstein Providence Campus
> 80 Washington Street
> Providence, RI 02903-1819
> tel. (401) 277-5218
> fax: (401) 277-5464
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list