Are blank fields allowed in SAS data sets?

Douglas Bates bates@stat.wisc.edu
06 Jan 2000 17:25:02 -0600


Saikat DebRoy and I have been working on an R package that, among
other things, will read SAS data libraries in the XPORT format.  Even
though a SAS data set is of a fairly simple structure, it is a
challenge to write code to read the libraries "properly".  In fact, I
am beginning to think that file format is not well-defined.

Here is the situation:  

  - a SAS data set is a table.  The columns can be numeric (in the
    XPORT format the only numeric format allowed is the IBM mainframe
    double precision format) or character strings.  In a character
    column, all the strings are blank-padded to the same length and
    that length is included in the header information.  There is no
    terminator character for strings.  There is no record terminator
    character.
  - a SAS library file can contain more than one table.
  - the header information for each table includes the number of
    columns and the format of each column but does _not_ include the
    number of rows.  (Why not? Remember that SAS was developed at a
    time when any large amount of data was stored on punched cards or
    magnetic tape.  SAS functions as a data filter, for the most
    part.  You don't know how many rows you have until you get to the
    end and then you can't go back and change the beginning because of
    the way magnetic tape drives work.  Of course, the relevance of
    these considerations to computing resources in the year 2000 is
    questionable.) 

So how do you know when you have reached the end of one data set and
started another?  SAS always works in blocks of 80 bytes.  If the last
record in a table does not completely fill an 80 byte block, the
remainder of the block is blank-padded.  The next table will begin
with 80 bytes that must be exactly
"HEADER RECORD*******MEMBER  HEADER RECORD!!!!!!!000000000000000001600000000140  "
except on certain VAX/VMS computers where the 140 at the end is 136.

I ran into a situation where the data were 484 rows of 8 numeric
columns.  The total length of each record is 64 bytes so the data
proper occupies 30976 bytes, not counting the headers.  This is a
total of 387 complete 80 byte blocks with 16 bytes left over.  That
last block is padded with 64 blanks.

So how do we know that these 64 blanks are not another data record?
In the case of numeric data, the particular number corresponding to 8
blanks (using the IBM mainframe floating point format, not the IEEE
format) is

> Pheno[745,]
           INDIV         TIME         DOSE       WEIGHT         CONC
745 3.687825e-40 3.687825e-40 3.687825e-40 3.687825e-40 3.687825e-40
          NEWSUB     APGARLOW         TLAG
745 3.687825e-40 3.687825e-40 3.687825e-40

We can probably employ some heuristics and say that this is not a
common value so we guess that the 64 blanks are padding and not
another data record.  

But what if they had been 8 character fields, each of width 8 bytes?
Is it possible to distinguish between a blank record and blank
padding?  I can't see that it is possible.  Is there some restriction
in SAS that says you can't have a record in which all the fields are
blank?

One could pass this off as of little interest to the R community
except that this format is now a "standard" that has been adopted by
the United States Food and Drug Administration (FDA).  See
       http://www.sas.com/software/industry/pht/fda/index.html

-- 
Douglas Bates                            bates@stat.wisc.edu
Statistics Department                    608/262-2598
University of Wisconsin - Madison        http://www.stat.wisc.edu/~bates/
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._