[R] Equivalent to Stata egen

Thu Apr 16 22:21:33 CEST 2009

On Apr 16, 2009, at 3:58 PM, Stas Kolenikov wrote:

> See, we just jave different expectations of what is to be seen in the
> help system, and are used to different formats. Yes, Stata thinks of
> data as a rectangular array (although it stores it in memory, unlike
> SAS). The inputs to -egen-, as well as the values produced, depend on
> the particular function -fcn- and are described in subsections on
> those individual functions. That is mentioned at the top of the page.
> There is a pretty much standard syntax of most Stata commands (command
> name followed by variables it is applied to or expression to be
> computed followed by if conditions on observations followed by comma
> options ), and -egen- more or less satisfies that syntax. A Stata user
> equipped with the basic concepts of the assignment command -generate-
> (which -egen- is said to extend) and variable lists (-varlist- here
> and there in the help file) would be able to make sense of this all.
>
> I would rather translate R's ave() to Stata's -by- expression. Not all
> of the -egen- functionality can be implemented via ave().

R has a by function which is a convenience wrapper for tapply. It will  
not necessarily produce an object with the same number of rows as the  
input, which is what I thought that egen was doing.

>
> Looks like terseness is a prerequisite to doing anything in R though.
> If I am telling you I am a newbie, the book abbreviations although
> standard to everybody on this list may not mean much to me. I could
> figure out "Regression Modeling Strategies" (although I was not
> thinking about it as a book on R -- I probably did not read it far
> enough :) ), and V&R is Venables & Ripley. Right?

Yes, and Chambers and Hastie wrote "Statistical Models in S".

The VR bundle is the way to get the MASS package (and IIRC three  
others).

The documentation and contributed pages are here:
http://cran.r-project.org/manuals.html
http://cran.r-project.org/other-docs.html

Harrell probably does not think of RMS as an R book either.

-- 
David Winsemius

>
> On 4/16/09, David Winsemius <dwinsemius at comcast.net> wrote:
>> Terse is OK by me as long as I get told what goes in (allowable  
>> data types,
>> argument names and effects) and what comes out. What seemed to be  
>> lacking in
>> that Stata doc for egen was a description of the purpose or  
>> behavior and
>> then could find no description of the values produced. Perhaps it  
>> is because
>> Stata has an approach that everything is a rectangular array? Is  
>> everything
>> assumed to create a new column of data as in SAS?
>>
>> At any rate it looked to this casual non-user, reading that  
>> document, that
>> egen creates a new variable aligned with its argument variables by  
>> applying
>> various functions within groupings. That is pretty much what ave  
>> does. "ave"
>> is not restricted to mean as a functional argument. As I said it  
>> was a
>> guess.
>>
>> The texts I used to get up to speed in R are several downloaded  
>> from the
>> Contributed documents (including anything written by Venables), V&R  
>> MASS v
>> 2, Harrell's RMS, Sarkar's Lattice, Chambers&Hastie SMiS and  
>> reading a lot
>> of Q&A on this list.
>>
>> --
>> David Winsemius
>>
>> On Apr 16, 2009, at 11:57 AM, Stas Kolenikov wrote:
>>
>>
>>> http://www.stata.com/help.cgi?egen -- it creates new
>> variables dealing
>>> with some special relatively non-standard tasks that don't boil down
>>> to a one-line arithmetic expressions. For that reason, there will be
>>> no equivalent to -egen- in general, as it has so many functions that
>>> are so different. -rowtotal- is of course just a shorthand for  
>>> sum(),
>>> except for treatment of missing values ( ifelse(is.na(x),0,x ). But
>>> -anycount- is a moderately complicated double cycle over variables  
>>> and
>>> list of values (40 lines of underlying Stata code, including parsing
>>> and labeling the resulting variables)... which will probably  
>>> become a
>>> triple R cycle including the cycle over observations, although the
>>> latter can probably be avoided.
>>>
>>> Yes, R documentation looks exteremely terse to me as a regular Stata
>>> user. I am used to seeing the concpets explained well, even in the
>>> help files, and certainly more so in the shelved books. As every
>>> option and every part of the syntax is devoted at least three to  
>>> five
>>> sentences, and the most common uses are exemplified, I can usually
>>> figure out how to run a particular task relatively quickly. (The  
>>> data
>>> management tricks, which is what Peter was asking about above, are
>>> probably an exception: you either know them, or you don't. In this
>>> example, I don't know the corresponding R tricks, although I can
>>> probably brute force the solution if I needed to.) The fraction of
>>> commands in R that I personally have been coming across that are
>>> comparably well documented is about a quarter. For other, it is  
>>> either
>>> a guesswork+CRANning+googling around or "Forget it, I'll just go  
>>> back
>>> to Stata to do it" after a few futile attempts. May be I just don't
>>> know where to look for the good stuff, but it is certainly outside R
>>> as a package+its documentation.
>>>
>
> -- 
> Stas Kolenikov, also found at http://stas.kolenikov.name
> Small print: I use this email account for mailing lists only.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Heritage Laboratories
West Hartford, CT