[R] cannot read iso639 table

William Dunlap wdunlap at tibco.com
Thu Sep 13 23:15:12 CEST 2012


> Pragmatically, one can zap the BOM from the output with
> 
> language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)

On Windows with locale "Englist_United States.1252" my R-2.15.1 could not
get that far:
  >  socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
  +                open="r",encoding="utf-8");
  > read.table(socket, quote="", sep="|")
    V1
  1  ?
  Warning messages:
  1: In read.table(socket, quote = "", sep = "|") :
    invalid input found on input connection 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  2: In read.table(socket, quote = "", sep = "|") :
    incomplete final line found by readTableHeader on 'http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt'
  > str(.Last.value)
  'data.frame':   1 obs. of  1 variable:
   $ V1: Factor w/ 1 level "?": 1
An initial readChar was the only way I could get it to work there.

Since Windows software seems to put a BOM at the top of a file to indicate that
it is using UTF-<something>, it would be nice if the connection code
at least had an option to deal with it.

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: peter dalgaard [mailto:pdalgd at gmail.com]
> Sent: Thursday, September 13, 2012 1:43 PM
> To: William Dunlap
> Cc: sds at gnu.org; r-help at r-project.org
> Subject: Re: [R] cannot read iso639 table
> 
> Pragmatically, one can zap the BOM from the output with
> 
> language.ISO.table[1,1] <- substring(language.ISO.table[1,1],2)
> 
> and be gone with it.
> 
> It would be nicer to zap the BOM before read.table, though. It does work for me with the
> below (notice that the BOM is a single character if you don't use useBytes=).
> 
> > get.language.ISO.table
> function () {
>  socket <- url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-8.txt",
>                open="r",encoding="utf-8");
>  readChar(socket, nchar=1)
>  data <- read.table(socket, as.is = TRUE, sep = "|", header = FALSE,
>                     col.names = c("a3bibliographic","a3terminologic",
>                       "a2","english","french"), quote="");
>  close(socket);
>  data
> }
> 
> 
> On Sep 13, 2012, at 22:26 , William Dunlap wrote:
> 
> > It would be helpful if you showed your commands and printed
> > outputs, copied directly from your R session, from the beginning
> > to the end.  I put the call to sessionInfo() in my message because
> > it is probably relevant.  It is nice to completely include the original
> > email when responding to it so others can see the whole story in
> > one place.
> >
> > Bill Dunlap
> > Spotfire, TIBCO Software
> > wdunlap tibco.com
> >
> >
> >> -----Original Message-----
> >> From: Sam Steingold [mailto:sam.steingold at gmail.com] On Behalf Of Sam Steingold
> >> Sent: Thursday, September 13, 2012 1:18 PM
> >> To: William Dunlap
> >> Cc: peter dalgaard; r-help at r-project.org
> >> Subject: Re: [R] cannot read iso639 table
> >>
> >>> * William Dunlap <jqhaync at gvopb.pbz> [2012-09-13 19:50:21 +0000]:
> >>>
> >>> On Windows with R-2.15.1 in a 1252 locale, I had to read (and toss) out
> >>> the initial 3 bytes (the byte-order mark?) to make things work:
> >>>
> >>>> socket <-
> >>>> url("http://www.loc.gov/standards/iso639-2/ISO-639-2_utf-
> >> 8.txt",open="r",encoding="utf-8")
> >>>> readChar(socket, nchars=3, useBytes=TRUE)
> >>>  [1] ""
> >>
> >> confirmed - first 3 bytes are "\357\273\277"
> >>
> >>>> d <- read.table(socket, quote="", sep="|", stringsAsFactors=FALSE)
> >>>> dim(d)
> >>>  [1] 485   5
> >>>> head(d)
> >>>     V1 V2 V3             V4      V5
> >>>  1 aar    aa           Afar    afar
> >>>  2 abk    ab      Abkhazian abkhaze
> >>>  3 ace             Achinese    aceh
> >>>  4 ach                Acoli   acoli
> >>>  5 ada              Adangme adangme
> >>>  6 ady       Adyghe; Adygei  adyghé
> >>
> >> alas, this is all I get:
> >>
> >> Warning message:
> >> In scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings,  :
> >>  invalid input found on input connection 'http://www.loc.gov/standards/iso639-
> 2/ISO-
> >> 639-2_utf-8.txt'
> >>
> >>  a3bibliographic a3terminologic a2        english  french
> >> 1             aar             NA aa           Afar    afar
> >> 2             abk             NA ab      Abkhazian abkhaze
> >> 3             ace             NA          Achinese    aceh
> >> 4             ach             NA             Acoli   acoli
> >> 5             ada             NA           Adangme adangme
> >> 6             ady             NA    Adyghe; Adygei   adygh
> >>
> >> note that the first non-ASCII character terminates the input.
> >>
> >> so, I still cannot read the data from the URL.
> >>
> >> I can read the file though - with quote="" (thanks Peter!) -
> >> except that the first record is "\357\273\277aar".
> >>
> >>
> >> --
> >> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X 11.0.11103000
> >> http://www.childpsy.net/ http://thereligionofpeace.com
> >> http://mideasttruth.com http://iris.org.il http://jihadwatch.org
> >> The only thing worse than X Windows: (X Windows) - X
> 
> --
> Peter Dalgaard, Professor,
> Center for Statistics, Copenhagen Business School
> Solbjerg Plads 3, 2000 Frederiksberg, Denmark
> Phone: (+45)38153501
> Email: pd.mes at cbs.dk  Priv: PDalgd at gmail.com
> 
> 
> 
> 
> 
> 
> 




More information about the R-help mailing list