[R] SAS transport files and the foreign package
Frank E Harrell Jr
fharrell at virginia.edu
Sat Jan 18 16:07:03 CET 2003
Even though the FDA has no policies at all that limit our choices of statistical software, there is one defacto standard in place: reliance of the SAS transport file format for data submission (even though this format is deficient for this purpose, e.g., it does not even document value labels or units of measurement in a self-contained way). Because of the widespread use of SAS transport files in the pharmaceutical industry, clinical trial data analyses done by statistical centers like ours who receive data from companies often begin with SAS transport files. I have not had SAS on my machines in about 12 years so it would be nice to be able to read binary transport files instead of having to run the slower sas.get function in the Hmisc library. sas.get has to launch SAS to do its work.
The foreign package implements a quick way to read such files in its read.xport function. This function has some significant problems which I have reported to the developers some time ago but fixes do not seem to be forthcoming nor have acknowledgements of the bug report. The developers have done great work in writing the foreign package (and many other awesome contributions to the community) so I don't fault them at all for being creative, busy people. I am writing this note to see if any C language-savvy R users have done their own fixes or would be willing to help the developers with these particular fixes. The specific problems I have found are (1) a worrisome one in which reasonable but invalid data result from importing SAS numeric variables of length 3 bytes; and (2) getting corrupted files when the SAS transport file contains multiple SAS datasets. In addition, it would be great to have lookup.xport retrieve all SAS variable attributes including PROC FORMAT VALUE names, so that factor variables could be created as is done automatically with read.spss in foreign. Note there is also a problem with lookup.xport when there are multiple files. The documentation states that a list with a major element for each dataset will be created. read.xport is supposed to create a list of data frames for this case.
Here is SAS code I used to create test files, followed by R output.
libname x SASV5XPT "test.xpt";
libname y SASV5XPT "test2.xpt";
PROC FORMAT; VALUE race 1=green 2=blue 3=purple; RUN;
PROC FORMAT CNTLOUT=format;RUN;
LENGTH race 3 age 4;
age=30; label age="Age at Beginning of Study";
format d1 mmddyy10. dt1 datetime. t1 time. race race.;
/* PROC CPORT LIB=work FILE='test.xpt';run; * no; */
PROC COPY IN=work OUT=x;SELECT test;RUN;
PROC COPY IN=work OUT=y;SELECT test format;RUN;
 "numeric" "numeric" "numeric" "numeric" "numeric"
 3 4 8 8 8
 1 2 3 4 5
 0 3 7 15 23
 "RACE" "AGE" "D1" "DT1" "T1"
 14 14 14 14 14
Same output except tailpad=76, length=124, second dataset ignored.
RACE AGE D1 DT1 T1
1 2.000063 30.00000 15402 1330767062 40425
2 4.000063 31.00000 15494 1338716527 40453
RACE AGE D1 DT1 T1
1 2.000063e+00 3.000000e+01 1.540200e+04 1.330767e+09 4.042500e+04
2 4.000063e+00 3.100000e+01 1.549400e+04 1.338717e+09 4.045300e+04
. . . .
122 3.687825e-40 3.687825e-40 3.687825e-40 5.868918e-40 3.687825e-40
123 5.904941e-40 2.942346e+63 9.068390e+43 NA -5.524256e-48
124 3.835229e-93 6.434447e-86 NA 3.687825e-40 3.687825e-40
test.xpt and test2.xpt may be retrieved from http://hesweb1.med.virginia.edu/biostat/tmp
They were created on an IBM AIX machine running SAS 8.
Thanks very much for any assistance. -Frank
Frank E Harrell Jr Prof. of Biostatistics & Statistics
Div. of Biostatistics & Epidem. Dept. of Health Evaluation Sciences
U. Virginia School of Medicine http://hesweb1.med.virginia.edu/biostat
More information about the R-help