[R] Reading sas7bdat files directly

Nordlund, Dan (DSHS/RDA) NordlDJ at dshs.wa.gov
Tue Mar 2 02:45:59 CET 2010


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of Roger DeAngelis(xlr82sas)
> Sent: Monday, March 01, 2010 4:38 PM
> To: r-help at r-project.org
> Subject: Re: [R] Reading sas7bdat files directly
> 
> 
> Hi All,
> 
>  The hack below might help R users get going with Chris's DSREAD. I have not
> had a chance to look at  Monday's version of DSREAD, can't wait.
> 
> Note Duncan Murdoch was most gracious to supply me with a R function to
> translate floats in 16 char hex to R floats.
> 
> Your utility solves the 200 byte, 8 char name and potential precision
> errors with other methods of transfering SAS datasets to perl and R.
> Thanks.
> 
> Importing SAS datasets(sas7bdat) into R
> (32 bit windows 2000, 32 bit SAS 9.2 and
> 32 bit R version 2.9.0 (2009-04-17)
> 
>  Here is what I want to accomplish, the double floats below show data
> from SAS to R.
>  They are exactly the same in R and SAS memory, bit for bit.
> 
> 
>   R Internal         SAS Internal
>   16 Byte Float      16 byte Float
> 
> 
> 3FFAAAAAAAAAAAAB  3FFAAAAAAAAAAAAB
> 4002AAAAAAAAAAAB  4002AAAAAAAAAAAB
> 400D555555555555  400D555555555555
> 3FF6666666666666  3FF6666666666666
> 3FFCCCCCCCCCCCCD  3FFCCCCCCCCCCCCD
> 400199999999999A  400199999999999A
> 4004CCCCCCCCCCCD  4004CCCCCCCCCCCD
> 3FF4924924924925  3FF4924924924925
> 3FF9249249249249  3FF9249249249249
> 3FFDB6DB6DB6DB6E  3FFDB6DB6DB6DB6E
> 4001249249249249  4001249249249249
> 3FF2E8BA2E8BA2E9  3FF2E8BA2E8BA2E9
> 3FF5D1745D1745D1  3FF5D1745D1745D1
> 3FF8BA2E8BA2E8BA  3FF8BA2E8BA2E8BA
> 3FFBA2E8BA2E8BA3  3FFBA2E8BA2E8BA3
> 3FF2762762762762  3FF2762762762762
> 3FF4EC4EC4EC4EC5  3FF4EC4EC4EC4EC5
> 3FF7627627627627  3FF7627627627627
> 3FF9D89D89D89D8A  3FF9D89D89D89D8A
> 1.7976931348623E  1.7976931348623E
> 0010000000000000  0010000000000000
> 
> 
> I don't believe this high accuracy transfer is possible  with any
> other method except ODBC,
> but SAS ODBC is unsatisfactory for me. If you use CSV with the maximum
> assured decimal
> precision(15 significant digits?). The CSV decimal numbers will only
> approximate the double floats.
> 
> I consider the Csv to be corrupt if the relative of absolute
> difference using the decimal
> Csv numbers and the memory floats is greater than 10^-12.  There are
> two sources of error first
> the SAS floats are decimally rounded and converted to decimal then the
> rounded decimal
> approximations are  converted into R floats.
> 
> 
> Status of     R Internal            CSV
> Csv           16 Byte Float
> 
> 
> Csv corrupt 3FFAAAAAAAAAAAAB   1.66666666666667    >10^-12 different
> Csv corrupt 4002AAAAAAAAAAAB   2.33333333333333
> Csv corrupt 400D555555555555   3.66666666666667
> Csv OK      3FF6666666666666   1.4
> Csv OK      3FFCCCCCCCCCCCCD   1.8
> Csv OK      400199999999999A   2.2
> Csv OK      4004CCCCCCCCCCCD   2.6
> Csv corrupt 3FF4924924924925   1.28571428571429
> Csv corrupt 3FF9249249249249   1.57142857142857
> Csv corrupt 3FFDB6DB6DB6DB6E   1.85714285714286
> Csv corrupt 4001249249249249   2.14285714285714
> Csv corrupt 3FF2E8BA2E8BA2E9   1.18181818181818
> Csv corrupt 3FF5D1745D1745D1   1.36363636363636
> Csv corrupt 3FF8BA2E8BA2E8BA   1.54545454545455
> Csv corrupt 3FFBA2E8BA2E8BA3   1.72727272727273
> Csv corrupt 3FF2762762762762   1.15384615384615
> Csv corrupt 3FF4EC4EC4EC4EC5   1.30769230769231
> Csv corrupt 3FF7627627627627   1.46153846153846
> Csv corrupt 3FF9D89D89D89D8A   1.61538461538462
> Csv corrupt 1.7976931348623E   1.7976931348623E+308
> Csv corrupt 0010000000000000   2.2250738585072E-308
> 
> 
> Bacground
> 
> 
>   1. Provide absolutely loss less transfer
>      of character(max 32756 bytes per character variable)  and numeric
> data from SAS to R
>      Since SAS has only two datatypes so this code should be
> exhaustive.
> 
> 
>   2. This code is useful because:
>      a. The SAS ODBC driver requires the user to not only have
>         SAS but the user must bring up a SAS session and
>         the session has to be closed manually. (SAS issue not a
> foreign issue)
>      b. The foreign package also requires interaction with SAS. (SAS
> issue)
>      c. SASxport only supports 8 character SAS names and a max of
>         200 byte character values. (This is a SAS issue not a SASxport
> issue)
>      d. SASxport creates floating point doubles that have an 8 bit
> exponent
>         and 56 bit mantissa while IEEE is 11 bit exponent and 53 bit
> mantissa
>         (sometimes defined slightly differently depending of where you
> consider
>         the sign bits). This results is the loss of some very small
> and
>         very large numbers. ( SAS issue not a SASxport issue)
> 
> 
>   3. How this code overcomes the issues above for import only.
> 
> 
>      You need the dsread exec in the previous mesage. Also the input
> SAS dataset must have
> 16 byte character representations for the floats. I am working with
> the developer to see what we
> can do about this..
> He will make it an option on the invocation to do the hex conversion
> for numerics.
> 
> 
> Here is the R code run inside a SAS datastep. Actually I can interact
> with the output of the R code
> in the same dataqstep. It is also possible to run perl, SAS procs and
> other SAS languages in the same datastep.
> Note the input pipe, no physical CSV file is produced).
> 
> 
> If there is interest I can provide the code that executes R.
> 
> 
> data _null_;
>   length pgm $1250;
>   pgm=compbl("
>   library (SASxport);
>   library (foreign);
>   hexdigits <- function(s) {;
>       digits <- 0:15;
>       names(digits) <- c(0:9, LETTERS[1:6]);
>       digits[strsplit(s, '')[[1]]];
>   };
>   bytes <- function(s) {;
>       digits <- matrix(hexdigits(s), ncol=2, byrow=TRUE);
>       digits;
>       as.raw(digits %*% c(16,1));
>   };
>   todouble <- function(bytes) {;
>       con <- rawConnection(bytes);
>       val <- readBin(con, 'double', endian='big');
>       close(con);
>       val;
>   };
>   x <-c(1:21);
>   rc<-c(1:21);
>   ln<-c(1:21);
>   z<-read.table(pipe('C:\\tip\\dsread.exe -v C:\\tip\
> \fix.sas7bdat'),header=TRUE,sep=',',colClasses='character');
>   st<-z$STR;
>   lin<-z$LIN;
>   d<-as.numeric(z$DECIMAL_REPRESENTATION);
>   h<-as.character(z$HEXIDECIMAL_REPRESENTATION);
>   for ( i in 1:21 ) {;
>     x[i]  <- todouble(bytes(h[i]));
>     rc[i] <- if (((abs( x[i] - d[i] )       > 1E-12 )) || ;
>              (abs((x[i] - d[i])/x[i] ) > 1E-12 )) 0 else 1;
>     ln[i] <- nchar(st[i], type = 'bytes');
>   };
>   R_ntrnl    <-h ;
>   SASntrnl   <-h ;
>   R_deciml   <-sprintf('%.14e',x);
>   SAS_deciml <-sprintf('%.14e',x);
>   Csv_stmat  <-z$DECIMAL_UNTOUCHED;
>   Corrupt    <-rc;
>   datfrm     <-
> data.frame(R_ntrnl ,SASntrnl ,R_deciml ,SAS_deciml ,Csv_stmat
> ,Corrupt,ln,lin);
>   write.xport(datfrm,file='C:\\utl\
> \datfrm.xpt',autogen.formats=FALSE);
>   ");
>   call rxeq(pgm);
>   call getxpt('datfrm');
> run;
> 
> 
> SAS code to create fix.sas7bdat
> 
> 
> options xsync xwait;run;
> %let fac=1000;
> data "c:\tip\fix.sas7bdat"(drop=prime nonprime byt);
>   retain byt 0  str;
>   length str $%eval(&fac * 32);
>   do prime=3,5,7,11,13;
>     do nonprime=2,4,6,8;
>       byt+&fac;
>       str=repeat(byte(64+byt/&fac),byt);
>       decimal_representation    =nonprime/prime+1;
>       hexidecimal_representation=put(decimal_representation,hex16.);
>       decimal_untouched         =cats(put(round(decimal_representation,
> 1e-14),best32.));
>       lin=length(str);
>       if decimal_representation ne 3 then output;
>     end;
>   end;
>   decimal_representation    =constant('big');
>   hexidecimal_representation=put(constant('big'),e20.);
>   decimal_untouched         =cats(put(decimal_representation,e20.));
>   str=repeat('@',%eval(&fac * 30));
>   lin=length(str);
>   output;
>   decimal_representation    =constant('small');
>   hexidecimal_representation=put(constant('small'),hex16.);
>   decimal_untouched         =cats(put(decimal_representation,e20.));
>   str=repeat('@',%eval(&fac * 32));
>   lin=length(str);
>   output;
>   format _numeric_ e20.;
> run;
> 
> 

The announcement that Chris Long made about a HEX output method for the dsread utility was to output the hex representation in "little endian" byte order, so the above routines will need to take that into account.

Dan

Daniel J. Nordlund
Washington State Department of Social and Health Services
Planning, Performance, and Accountability
Research and Data Analysis Division
Olympia, WA  98504-5204



More information about the R-help mailing list