# [R] sas to r

Frank E Harrell Jr f.harrell at vanderbilt.edu
Sat Jul 17 14:36:20 CEST 2004

```Greg Adkison wrote:
> I would be incredibly grateful to anyone who'll help me translate some
> SAS code into R code.
>
> Say for example that I have a dataset named "dat1" that includes five
> variables:  wshed, site, species, bda, and sla.  I can calculate with the
> following SAS code the mean, CV, se, and number of observations of
> "bda" and "sla" for each combination of "wshed," "species," and "site,"
> restricting the species considered to only three of several species in
> dat1 (b, c, and p).  Moreover, I can output these calculations and
> grouping variables to a dataset named "dat2" that will reside in RAM
> and include the variables  wshed, site, species, mBdA, msla, cBda,
> sBdA, ssla, nBda, and nsla.
>
> proc sort data=dat1;
>   by wshed site species;
> proc means data=dat1 noprint mean cv stderr n;
>   by wshed site species;
>   where species in ('b', 'c', 'p');
>   var BdA sla;
>   output out=dat2
>     mean=mBdA msla
>     cv=cBdA csla
>     stderr=sBdA ssla
>     n=nBdA nsla;
>
> Thanks,
> Greg

The following handles any number of analysis variables, with automatic
naming of all statistics computed from them.  It requires the Hmisc package.

# Generate some data.  Put one NA in sla.
set.seed(1)
dat1 <- expand.grid(wshed=1:2, site=c('A','B'),
species=c('a','b','c','p'),
reps=1:10)
n <- nrow(dat1)
dat1 <- transform(dat1,
BdA = rnorm(n, 100, 20),
sla = c(rnorm(n-1, 200, 30), NA))
# Can use upData function in Hmisc in place of transform

# Summarization function, per stratum, for a matrix of analysis
# variables
g <- function(y) {
n <- apply(y, 2, function(z) sum(!is.na(z)))
m <- apply(y, 2, mean, na.rm=TRUE)
s <- apply(y, 2, sd,   na.rm=TRUE)
cv <- s/m
se <- s/sqrt(n)
w <- c(m, cv, se, n)
names(w) <- t(outer(c('m','c','s','n'), colnames(y), paste, sep=''))
w
}
library(Hmisc)
dat2 <-  with(dat1,
summarize(cbind(BdA, sla),
llist(wshed, site, species),
g,
subset=species %in% c('b','c','p'),
stat.name='mBdA')
)

options(digits=3)
dat2  # is a data frame

wshed site species  mBdA msla  cBdA   csla sBdA  ssla nBdA nsla
1      1    A       b 100.5  195 0.133 0.1813 4.23 11.20   10   10
2      1    A       c  99.7  206 0.101 0.1024 3.17  6.68   10   10
3      1    A       p 101.4  188 0.239 0.1580 7.65  9.39   10   10
4      1    B       b 109.9  203 0.118 0.1433 4.09  9.21   10   10
5      1    B       c  98.4  221 0.193 0.1250 6.01  8.72   10   10
6      1    B       p 102.9  203 0.216 0.1446 7.03  9.29   10   10
7      2    A       b  95.8  195 0.241 0.2011 7.31 12.40   10   10
8      2    A       c  98.7  207 0.194 0.1274 6.04  8.33   10   10
9      2    A       p 102.2  191 0.217 0.1709 7.01 10.31   10   10
10     2    B       b  97.8  191 0.235 0.2079 7.27 12.58   10   10
11     2    B       c 100.9  194 0.164 0.0987 5.24  6.07   10   10
12     2    B       p 103.0  209 0.144 0.0769 4.69  5.35   10    9

--
Frank E Harrell Jr   Professor and Chair           School of Medicine
Department of Biostatistics   Vanderbilt University

```