[Rd] foreign::read.dbf fails to parse dbf properly

Roger Bivand Roger@B|v@nd @end|ng |rom nhh@no
Sat Jul 30 13:27:27 CEST 2022


On Sat, 30 Jul 2022, r-devel-request using r-project.org wrote:

>
> Dear R developers,
>
> tl;dr I've been trying to read foxpro dbf files with
> foreign::read.dbf(), they weren't being read properly, I patched the
> foreign package to make it work, now what?
>
> Long version:
> I recently encountered unexpected behavior attempting to read dbf files
> using foreign::read.dbf() from here:
>
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fforms.ferc.gov%2Ff1allyears%2Ff1_2020.zip&data=05%7C01%7CRoger.Bivand%40nhh.no%7C898bc868ceb14a50bef308da72127f0a%7C33a15b2f849941998d56f20b5aa91af2%7C0%7C0%7C637947721632932545%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=kQ8NhGxrgL4H0X3XsJPEoCAnEZIcmvwdEkUbepJupgE%3D&reserved=0

As you may have seen in the code or the help page, the first port from the 
then version of shapelib: https://github.com/OSGeo/shapelib was made by 
Nicholas Lewin-Koh 20 years ago. The code is largely unchanged since then.

What is your OS and R version?

Reading F1_15.DBF (Fedora 36, locally built R 4.2.1), I see:

> summary(F1_15.DBF)
  RESPONDENT      REPORT_YEA     SPPLMNT_NU     ROW_NUMBER       ROW_SEQ
  \xc2   :  350   \xe4\a:25774   NA's:25774   U      :  874   U      :  874
  \001   :  248                               C      :  864   C      :  864
  \002   :  208                               T      :  846   T      :  846
  x      :  206                               \004   :  845   \004   :  845
  \x85   :  197                               \002   :  840   \002   :  840
  (Other):24363                               (Other):20813   (Other):20813
  NA's   :  202                               NA's   :  692   NA's   :  692
...

which is why another problem may be encoding since R 4.2 on Windows 
(UCRT).

The help page does say:

"The DBF format is documented but not much adhered to.  There is is
no guarantee this will read all DBF files."

and:

      'read.dbf' is based on C code from <http://shapelib.maptools.org/>
      which implements the 'XBASE' specification.  It can convert fields
      of type '"L"' (logical), '"N"' and '"F"' (numeric and float) and
      '"D"' (dates): all other field types are read as-is as character
      vectors.  A numeric field is read as an R integer vector if it is
      encoded to have no decimals, otherwise as a numeric vector.
      However, if the numbers are too large to fit into an integer
      vector, it is changed to numeric.  Note that is possible to read
      integers that cannot be represented exactly even as doubles: this
      sometimes occurs if IDs are incorrectly coded as numeric.

So pre-converting seems easier than retro-fitting, given the time since 
the function was first published. Libre Office seems to see 40, and 
writes 40 in a more accessible way, which can be read by read.dbf().

Using a program from GDAL (locally built 3.5.1, with its bundled shapelib, 
on Fedora 36 UTF-8 locale ), https://gdal.org/programs/ogr2ogr.html,

ogr2ogr -f CSV F1_15.csv F1_15.DBF
Warning 1: One or several characters couldn't be converted correctly from 
CP1252 to UTF-8.  This warning will not be emitted anymore

and "(" not 40. So it doesn't seem that updating the shapelib files in 
foreign would help.


In addition, there is an error under options("warn"=2L) in:

F1_EMAIL.DBF
Error in read.dbf(i) :
   (converted from warning) value |0| found in logical field

and possibly others which do not seem to relate to the field definition 
problem you identified.

Roger

>
> unzipped, in UPLOADERS/FORM1/working/F1_15.DBF - and as a note, this is
> a foxpro database. I would expect the first row of the first column to
> be 40, instead I am getting "(" (realizing that "(" has a decimal ascii
> value of 40). The xbase docs indicate that this is a field of type "I"
> which is a 4-byte integer unique to foxpro, and it doesn't look like
> this case is contemplated by read.dbf()
>
> I made some modifications to Rdbfread.c and dbfopen.c in the foreign
> package (version 0.8-82) to add specific handling for field type "I".
>
> I'm not current set up to contribute directly, I don't have SVN access.
>
> 1. Is this patch of general interest? I'm weighing in the development
> guidelines:
>  - DO NOT fix exotic bugs that haven't bugged anyone
>  - DO make small enhancements if they are badly needed
> and I feel like this is maybe a bit of an exotic lack-of-feature
> (wouldn't call it a bug), and I have no idea if this is badly needed
> (by anybody, other than myself)
>
> 2. if of general interest, how can I get set up with SVN credentials
> for R-packages?
>
> Thanks!
> -Ezra
>
>
>
>
> ------------------------------
>
> Subject: Digest Footer
>
> _______________________________________________
> R-devel using r-project.org mailing list  DIGESTED
> https://eur02.safelinks.protection.outlook.com/?url=https%3A%2F%2Fstat.ethz.ch%2Fmailman%2Flistinfo%2Fr-devel&data=05%7C01%7CRoger.Bivand%40nhh.no%7C898bc868ceb14a50bef308da72127f0a%7C33a15b2f849941998d56f20b5aa91af2%7C0%7C0%7C637947721632932545%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C&sdata=k1kdMDHIQJRYAKm1HVWiP9SvsxEDtP9JpF3r08HoGxU%3D&reserved=0
>
>
> ------------------------------
>
> End of R-devel Digest, Vol 233, Issue 17
> ****************************************
>

-- 
Roger Bivand
Emeritus Professor
Department of Economics, Norwegian School of Economics,
Postboks 3490 Ytre Sandviken, 5045 Bergen, Norway.
e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en



More information about the R-devel mailing list