[R] Casting lists to data.frames, analog to SAS

Marc Schwartz marc_schwartz at comcast.net
Wed Jan 14 22:42:10 CET 2009


on 01/14/2009 02:51 PM Matthew Pettis wrote:
> I have a specific question and a general question.
> 
> Specific Question: I want to do an analysis on a data frame by 2 or more
> class variables (i.e., use 2 or more columns in a dataframe to do
> statistical classing).  Coming from SAS, I'm used to being able to take a
> data set and have the output of the analysis in a dataset for further
> manipulation.  I have a data set with vote totals, with one column being the
> office name being voted on, and the other being the party of the candidate.
> My votes are in the column "vc.n".  I did the analysis I want with:
> 
> work <- by(sd62[,"vc.n"], sd62[,c("office.nm","party.abbr")], sum)
> 
> the str() output of work looks like:
> 
>> str(work)
>  'by' int [1:9, 1:11] NA 30 NA NA 0 0 0 NA 33 25678 ...
>  - attr(*, "dimnames")=List of 2
>   ..$ office.nm : chr [1:9] "ATTORNEY GENERAL" "GOVERNOR & LT GOVERNOR"
> "SECRETARY OF STATE" "STATE AUDITOR" ...
>   ..$ party.abbr: chr [1:11] "CP" "DFL" "DFL2" "GP" ...
>  - attr(*, "call")= language by.default(data = sd62[, "vc.n"], INDICES =
> sd62[, c("office.nm",      "party.abbr")], FUN = sum)
> 
> 
> 
> 
> work is now a list.  I'd really like to have work be a data frame with 3
> columns: The rows of the first two columns show the office and party levels
> being considered, and the third being the sum of the votes for that level
> combination.  How do I cast this list/output into a data frame?  using
> 'as.data.frame' doesn't work.
> 
> General Question: I assume the answer to the specific question is dependent
> on my understanding list objects and accessing their attributes.  Can anyone
> point me to a good, throrough treatment of these R topics?  Specifically how
> to read and interpret the output of the str(), and attributes() function,
> how to extract the values of the 'by' output object into a data frame, etc.?
> 
> Thanks,
> Matt

Matt,

Welcome to R.

The help pages for each function, while they can be intentionally terse,
are a good first place to look. Many will include links/references to
related sources.

"An Introduction to R" is a good general place to start. A more thorough
treatment is in the "R Language Definition" manual. There are also a
plethora of contributed documents:

  http://cran.r-project.org/other-docs.html

and books on R and using R within specific domains:

  http://www.r-project.org/doc/bib/R-books.html


There are (at least) three ways to generate summary statistics based
upon multi-level groupings. These include by(), tapply() and aggregate().

The key difference between the three is the class/structure of the
results object and the print (output) method. In the specific case of
aggregate(), it must also return a scalar. Thus for example, unlike with
by() and tapply(), you cannot use summary(), which returns multiple values.

Thus the choice for which approach to take, to an extent, is founded on
what you may subsequently do with the data.

As an example, using the same set of data (warpbreaks):

> str(warpbreaks)
'data.frame':	54 obs. of  3 variables:
 $ breaks : num  26 30 54 25 70 52 51 26 67 18 ...
 $ wool   : Factor w/ 2 levels "A","B": 1 1 1 1 1 1 1 1 1 1 ...
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 1 1 1 1 1 1 1 2 ...


# Use by()

> by(warpbreaks[, 1],
     list(wool = warpbreaks$wool, tension = warpbreaks$tension), sum)
wool: A
tension: L
[1] 401
------------------------------------------------------
wool: B
tension: L
[1] 254
------------------------------------------------------
wool: A
tension: M
[1] 216
------------------------------------------------------
wool: B
tension: M
[1] 259
------------------------------------------------------
wool: A
tension: H
[1] 221
------------------------------------------------------
wool: B
tension: H
[1] 169



Note, because the result of using by() is at its core, a matrix/table,
you can also do the following, explicitly using the print method for a
table:

> print.table(by(warpbreaks[, 1],
              list(wool = warpbreaks$wool,
                   tension = warpbreaks$tension), sum))
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169


which gives you printed output in the same format as tapply() below,
without altering the structure of the result itself.


# tapply() directly gives you a tabular output

> tapply(warpbreaks[, 1],
         list(wool = warpbreaks$wool, tension = warpbreaks$tension),
         sum)
    tension
wool   L   M   H
   A 401 216 221
   B 254 259 169



Note that the structure of the result from by() and the result from
tapply() are quite similar:

> str(by(warpbreaks[, 1],
      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 by [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"
 - attr(*, "call")= language by.default(data = warpbreaks[, 1], INDICES
= list(wool = warpbreaks$wool,      tension = warpbreaks$tension), FUN =
sum)


> str(tapply(warpbreaks[, 1],
      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
 num [1:2, 1:3] 401 254 216 259 221 169
 - attr(*, "dimnames")=List of 2
  ..$ wool   : chr [1:2] "A" "B"
  ..$ tension: chr [1:3] "L" "M" "H"


Both are at their core, a 2 x 3 matrix.

The key difference is in the 'class' of the result, which affects
subsequent operations, such as the print method used.



# aggregate() gives you a data frame, with the summary statistic as the
# 'x' column

> aggregate(warpbreaks[, 1],
           list(wool = warpbreaks$wool, tension = warpbreaks$tension),
           sum)
  wool tension   x
1    A       L 401
2    B       L 254
3    A       M 216
4    B       M 259
5    A       H 221
6    B       H 169


> str(aggregate(warpbreaks[, 1],
      list(wool = warpbreaks$wool, tension = warpbreaks$tension),
      sum))
'data.frame':	6 obs. of  3 variables:
 $ wool   : Factor w/ 2 levels "A","B": 1 2 1 2 1 2
 $ tension: Factor w/ 3 levels "L","M","H": 1 1 2 2 3 3
 $ x      : num  401 254 216 259 221 169


Thus, bottom line, given your intended application, I would suggest
using aggregate() rather than by().

HTH,

Marc Schwartz




More information about the R-help mailing list