[R] issue with "strange" characters (locale settings)

R.T.A.J.Leenders r.t.a.j.leenders at rug.nl
Wed May 4 11:57:46 CEST 2011


   WinXP-x32, R-21.13.0
   Dear list,
   I have a problem that (I think) relates to the interaction between Windows
   and R.
   I am trying to scrape a table with data on the Hawai'ian Islands, This is my
   code:
   library(XML)
   u <- "http://en.wikipedia.org/wiki/Hawaii"
   tables <- readHTMLTable(u)
   Islands <- tables[[5]]
   The output is (first set of columns):
          Island            Nickname                                           
                       > Islands
          Island            Nickname                                           
                       Location
1    Hawaiʻi[7]      The Big Island     19°34′N 155°30′W / 19.567
°N 155.5°W / 19.567; -155.5
2        Maui[8]     The Valley Isle     20°48′N 156°20′W / 20.8°N
 156.333°W / 20.8; -156.333
3 Kahoʻolawe[9]     The Target Isle       20°33′N 156°36′W / 20.55
°N 156.6°W / 20.55; -156.6
4   LÄnaÊ»i[10]  The Pineapple Isle 20°50′N 156°56′W / 20.833°N 15
6.933°W / 20.833; -156.933
5  Molokaʻi[11]   The Friendly Isle 21°08′N 157°02′W / 21.133°N 1
57.033°W / 21.133; -157.033
6     Oʻahu[12] The Gathering Place 21°28′N 157°59′W / 21.467°N 1
57.983°W / 21.467; -157.983
7    Kauaʻi[13]     The Garden Isle     22°05′N 159°30′W / 22.083
°N 159.5°W / 22.083; -159.5
8   Niʻihau[14]  The Forbidden Isle     21°54′N 160°10′W / 21.9°N
 160.167°W / 21.9; -160.167

   As you can see, there are "weird" characters in there. I have also tried
   readHTMLTable(u,  encoding = "UTF-16") and readHTMLTable(u, encoding =
   "UTF-8")
   but that didn't help.
   It  seems to me that there may be an issue with the interaction of the
   Windows settings of the character set.
   sessionInfo() gives
   > sessionInfo()
   R version 2.13.0 (2011-04-13)
   Platform: i386-pc-mingw32/i386 (32-bit)
   locale:
   [1] LC_COLLATE=Dutch_Netherlands.1252  LC_CTYPE=Dutch_Netherlands.1252
   LC_MONETARY=Dutch_Netherlands.1252
   [4] LC_NUMERIC=C                       LC_TIME=Dutch_Netherlands.1252
   attached base packages:
   [1] stats     graphics  grDevices utils     datasets  methods   base
   other attached packages:
   [1] XML_3.2-0.2
   >
   I  have  also  attempted  to  let  R  use another setting by entering:
   Sys.setlocale("LC_ALL", "en_US.UTF-8"), but this yields the response:
   > Sys.setlocale("LC_ALL", "en_US.UTF-8")
   [1] ""
   Warning message:
   In Sys.setlocale("LC_ALL", "en_US.UTF-8") :
     OS reports request to set locale to "en_US.UTF-8" cannot be honored
   >
   In addition, I have attempted to make the change directly from the windows
   command prompt, using: "chcp 65001" and variations of that, but that didn't
   change anything.
   I have searched the list and the web and have found others bringing forth a
   similar issues, but have not been able to find a solution. I looks like this
   is  an  issue  of how Windows and R interact. Unfortunately, all three
   computers at my disposal have this problem. It occurs both under WinXP-x32
   and under Win7-x86.
   Is there a way to make R override the windows settings or can the issue be
   solved otherwise?
   I have also tried other websites, and the issue occurs every time when there
   is an é, Ì, À, î, et cetera in the text-to-be-scraped.
   Thank you,
   Roger


More information about the R-help mailing list