[R] A file with extension .sdb in a codebook section of a large database from a survey?

Douglas Bates bates at stat.wisc.edu
Thu Mar 25 22:58:13 CET 2010


Thanks very much, Marc.

So did you really read 110 pages of the User Guide to find this out or
did you do something more clever?

On Thu, Mar 25, 2010 at 4:38 PM, Marc Schwartz <marc_schwartz at me.com> wrote:
> On Mar 25, 2010, at 3:54 PM, Douglas Bates wrote:
>
>> The TIMSS2007 database http://timss.bc.edu/TIMSS2007/idb_ug.html seems
>> to provide "both kinds" of universal data formats - either SPSS saved
>> data sets or SAS saved data sets.  (Yes, I am being sarcastic.)
>> These, of course, are accompanied by massive codebooks explaining the
>> nature of each of the fields in the data sets.  The T07_Codebooks.zip
>> file available at that site contains .pdf files and .sdb files, which
>> seem to contain the information from the codebooks in some kind of
>> binary format.  Does anyone know where that format is defined.  I
>> imagine I could reverse-engineer it but would prefer not to do so.
>>
>> I would like to use part of this dataset as an example of a very large
>> hierarchically structured data set for analysis in lme4.
>
>
> Doug,
>
> According to the User Guide. bottom of page 110, they are "standard Dbase" files.  I tried reading one of them with read.dbf() in 'foreign', however that did not work. It would seem that if you rename the extensions from .sdb to .dbf, then they can be read with read.dbf():
>
> # rename ACGTMSM4.sbd to ACGTMSM4.dbf
>
>> str(read.dbf("ACGTMSM4.dbf"))
> 'data.frame':   116 obs. of  28 variables:
>  $ FIELD_NAME: Factor w/ 116 levels "AC4GAPAD","AC4GAPCH",..: 104 108 73 26 23 51 50 44 25 41 ...
>  $ FIELD_TYPE: Factor w/ 2 levels "C","N": 2 2 2 2 1 1 1 1 2 2 ...
>  $ FIELD_LEN : int  5 4 5 5 1 1 1 1 3 2 ...
>  $ FIELD_DEC : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ FIELD_LABL: Factor w/ 116 levels "COUNTRY ID","EXPLICIT STRATUM CODE",..: 1 98 74 75 76 71 70 48 47 72 ...
>  $ QUEST_LOC : Factor w/ 106 levels "COUNTRY","DATE",..: 1 8 9 10 11 12 13 14 15 16 ...
>  $ MISSING   : Factor w/ 6 levels "9","99","999",..: NA NA 4 4 1 1 1 1 3 2 ...
>  $ NOTAPPL   : Factor w/ 6 levels "8","98","998",..: NA NA 4 4 1 1 1 1 3 2 ...
>  $ DEFAULT   : Factor w/ 4 levels "7","97","997",..: NA NA 4 4 1 1 1 1 3 2 ...
>  $ FIELD_VALI: Factor w/ 105 levels ".T.","(AC4GAPAD>=0.AND.AC4GAPAD<=97).OR.AC4GAPAD=999.OR.AC4GAPAD=998",..: 1 14 13 10 30 54 53 47 9 11 ...
>  $ FIELD_CODE: Factor w/ 41 levels "0 TO 10 PERCENT:1;11 TO 25 PERCENT:2;26 TO 50 PERCENT:3;MORE THAN 50 PERCENT:4;omitted:9;not admin.:8;",..: 21 22 14 11 28 1 1 29 10 12 ...
>  $ FIELD_EDIT: logi  TRUE TRUE TRUE TRUE TRUE TRUE ...
>  $ FIELD_CARR: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
>  $ ORDER_SCRN: int  1 2 3 4 5 6 7 8 9 10 ...
>  $ ORDER_FILE: int  1 2 3 4 5 6 7 8 9 10 ...
>  $ COMMENT1  : Factor w/ 4 levels "Released in TIMSS 2003 as acdgpsc",..: NA NA NA NA NA NA NA NA NA NA ...
>  $ MEAS_CLASS: Factor w/ 6 levels "B","BD","D","DERI",..: 6 6 1 1 1 1 1 1 1 1 ...
>  $ IDBOOK    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  $ FMT       : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  $ DUMMY     : int  0 0 0 0 0 0 0 0 0 0 ...
>  $ VALID_VAL : Factor w/ 7 levels ".T.","1;2;","1;2;3;",..: 1 NA NA NA 6 4 4 4 NA NA ...
>  $ MIN_MAX   : Factor w/ 8 levels "0;11000","0;1200",..: NA 6 1 2 NA NA NA NA 7 8 ...
>  $ FILTER_VAR: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  $ FILTER_CND: Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  $ CONFIRMED : logi  NA NA NA NA NA NA ...
>  $ SASPG1    : Factor w/ 6 levels "B","BD","DPC",..: 5 5 1 1 1 1 1 1 1 1 ...
>  $ SASPG2    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  $ SASPG3    : Factor w/ 0 levels: NA NA NA NA NA NA NA NA NA NA ...
>  - attr(*, "data_types")= chr  "C" "C" "N" "N" ...
>
>
> However, there were warnings:
>
>> warnings()
> Warning messages:
> 1: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
> 2: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
> 3: In read.dbf("ACGTMSM4.dbf") : value |0| found in logical field
> ...
>
>
> The content of the above data frame does seem to correspond to the PDF file content.
>
> HTH,
>
> Marc Schwartz
>
>



More information about the R-help mailing list