[R] Casting lists to data.frames, analog to SAS

Matthew Pettis matthew.pettis at gmail.com
Wed Jan 14 22:46:30 CET 2009


Thank you very much -- this was very helpful for differentiating among
the aggregating methods!

Matt

On Wed, Jan 14, 2009 at 3:42 PM, Marc Schwartz
<marc_schwartz at comcast.net> wrote:
> on 01/14/2009 02:51 PM Matthew Pettis wrote:
>> I have a specific question and a general question.
>>
>> Specific Question: I want to do an analysis on a data frame by 2 or more
>> class variables (i.e., use 2 or more columns in a dataframe to do
>> statistical classing).  Coming from SAS, I'm used to being able to take a
>> data set and have the output of the analysis in a dataset for further
>> manipulation.  I have a data set with vote totals, with one column being the
>> office name being voted on, and the other being the party of the candidate.
>> My votes are in the column "vc.n".  I did the analysis I want with:
>>
>> work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
>>
>> the str() output of work looks like:
>>
>>> str(work)
>>  'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
>>  - attr(*, "dimnames")=List of 2
>>   ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
>> "SECRETARY OF STATE" "STATE AUDITOR" ...
>>   ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
>>  - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
>> sd62[, c("office.nm",      "party.abbr")], FUN = sum)
>>
>>
>>
>>
>> work is now a list.  I'd really like to have work be a data frame with 3
>> columns: The rows of the first two columns show the office and party levels
>> being considered, and the third being the sum of the votes for that level
>> combination.  How do I cast this list/output into a data frame?  using
>> 'as.data.frame' doesn't work.
>>
>> General Question: I assume the answer to the specific question is dependent
>> on my understanding list objects and accessing their attributes.  Can anyone
>> point me to a good, throrough treatment of these R topics?  Specifically how
>> to read and interpret the output of the str(), and attributes() function,
>> how to extract the values of the 'by' output object into a data frame, etc.?
>>
>> Thanks,
>> Matt
>
> Matt,
>
> Welcome to R.
>
> The help pages for each function, while they can be intentionally terse,
> are a good first place to look. Many will include links/references to
> related sources.
>
> "An Introduction to R" is a good general place to start. A more thorough
> treatment is in the "R Language Definition" manual. There are also a
> plethora of contributed documents:
>
>  http://cran.r-project.org/other-docs.html
>
> and books on R and using R within specific domains:
>
>  http://www.r-project.org/doc/bib/R-books.html
>
>
> There are (at least) three ways to generate summary statistics based
> upon multi-level groupings. These include by(), tapply() and aggregate().
>
> The key difference between the three is the class/structure of the
> results object and the print (output) method. In the specific case of
> aggregate(), it must also return a scalar. Thus for example, unlike with
> by() and tapply(), you cannot use summary(), which returns multiple values.
>
> Thus the choice for which approach to take, to an extent, is founded on
> what you may subsequently do with the data.
>
> As an example, using the same set of data (warpbreaks):
>
>> str(warpbreaks)
> 'data.frame':   54 obs. of  3 variables:
>  $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
>  $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
>  $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...
>
>
> # Use by()
>
>> by(warpbreaks[, 1],
>     list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
> wool: A
> tension: L
> [1] 401
> ------------------------------------------------------
> wool: B
> tension: L
> [1] 254
> ------------------------------------------------------
> wool: A
> tension: M
> [1] 216
> ------------------------------------------------------
> wool: B
> tension: M
> [1] 259
> ------------------------------------------------------
> wool: A
> tension: H
> [1] 221
> ------------------------------------------------------
> wool: B
> tension: H
> [1] 169
>
>
>
> Note, because the result of using by() is at its core, a matrix/table,
> you can also do the following, explicitly using the print method for a
> table:
>
>> print.table(by(warpbreaks[, 1],
>              list(wool = warpbreaks$wool,
>                   tension = warpbreaks$tension), sum))
>    tension
> wool   L   M   H
>   A 401 216 221
>   B 254 259 169
>
>
> which gives you printed output in the same format as tapply() below,
> without altering the structure of the result itself.
>
>
> # tapply() directly gives you a tabular output
>
>> tapply(warpbreaks[, 1],
>         list(wool = warpbreaks$wool, tension = warpbreaks$tension),
>         sum)
>    tension
> wool   L   M   H
>   A 401 216 221
>   B 254 259 169
>
>
>
> Note that the structure of the result from by() and the result from
> tapply() are quite similar:
>
>> str(by(warpbreaks[, 1],
>      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
>      sum))
>  by [1:2, 1:3] 401 254 216 259 221 169
>  - attr(*, "dimnames")=List of 2
>  ..$ wool   : chr [1:2] "A" "B"
>  ..$ tension: chr [1:3] "L" "M" "H"
>  - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
> = list(wool = warpbreaks$wool,      tension = warpbreaks$tension), FUN =
> sum)
>
>
>> str(tapply(warpbreaks[, 1],
>      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
>      sum))
>  num [1:2, 1:3] 401 254 216 259 221 169
>  - attr(*, "dimnames")=List of 2
>  ..$ wool   : chr [1:2] "A" "B"
>  ..$ tension: chr [1:3] "L" "M" "H"
>
>
> Both are at their core, a 2 x 3 matrix.
>
> The key difference is in the 'class' of the result, which affects
> subsequent operations, such as the print method used.
>
>
>
> # aggregate() gives you a data frame, with the summary statistic as the
> # 'x' column
>
>> aggregate(warpbreaks[, 1],
>           list(wool = warpbreaks$wool, tension = warpbreaks$tension),
>           sum)
>  wool tension   x
> 1    A       L 401
> 2    B       L 254
> 3    A       M 216
> 4    B       M 259
> 5    A       H 221
> 6    B       H 169
>
>
>> str(aggregate(warpbreaks[, 1],
>      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
>      sum))
> 'data.frame':   6 obs. of  3 variables:
>  $ wool   : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
>  $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
>  $ x      : num  401 254 216 259 221 169
>
>
> Thus, bottom line, given your intended application, I would suggest
> using aggregate() rather than by().
>
> HTH,
>
> Marc Schwartz
>



-- 
One of the penalties for refusing to participate in politics is that
you end up being governed by your inferiors.
-- Plato

It is from the wellspring of our despair and the places that we are
broken that we come to repair the world.
-- Murray Waas




More information about the R-help mailing list