[R] SAS XPORT data format [was Re: SAS or R software]

Douglas Bates bates at stat.wisc.edu
Sat Dec 18 17:00:31 CET 2004


Frank E Harrell Jr wrote:

... much discussion deleted ...

> Regarding CDISC, the SAS transport format that is now accepted by FDA is 
> deficient because there is no place for certain metadata (e.g., units of 
> measurement, value labels are remote from the datasets, variable names 
> are truncated to 8 characters).  The preferred format for CDISC will 
> become XML.

Since you brought up the SAS XPORT data format I have to respond with my 
usual rant about it.

<rant>
When it comes to the SAS XPORT data format those are at best third or 
fourth order deficiencies in the metadata.  The first order deficiency 
in the metadata is that it does not contain the number of records in a 
data set.  In this format a file can contain more than one data set and 
  a data set consists of an unknown number of fixed-length records. 
Because of the potential of more than one data set you can't just read 
to the end of the file or use the file size and the record size to 
calculate the number of records.  You must read through the file 
examining each group of 80 characters (Why 80 characters?  Those of us 
who remember punched cards can tell you why.) and for each such group 
try to determine if this is the beginning of another record in the 
current data set or the beginning of a new data set.  How is the 
beginning of a new data set indicated - by a magic string of characters. 
  What if, either perversely or accidently, this magic string of 
characters were included as a text field at the beginning of a record? 
You wouldn't be able to tell if you have a new record or a new data set.

Even better than that, there are situations in which the number of 
records in a data set is not well-defined due to the requirement of 
padding the last 80 character group with blanks.  (After all when you 
create a punch card deck from your data set you want to get an integer 
number of punched cards.) For example, if you are writing an odd number 
of 40 character records then you must pad the last 80 character group 
with blanks.  When reading this data set how can you distinguish the odd 
number of records padded with blanks from an even number of records in 
which the last record happened to be all blanks?  You can't.

When I first encountered this, I thought that I must not understand the 
format properly.  I thought that SAS (and, through SAS, the FDA) 
couldn't really be using a format in which the number of records in a 
data set can be ambiguous.  This would mean that the operations of 
writing the XPORT data set and reading it are not guaranteed to be 
inverses.  I started reading material on the SAS web site and discovered 
that SAS indeed was aware of this problem and had a solution - users 
should not create data sets that exhibit this abiguity.  That's it. 
Their solution is "don't do that".
</rant>

I think that replacing the SAS XPORT data format with XML will be a step 
forward.




More information about the R-help mailing list