[R] Weird 'xmlEventParse' encoding issue

Duncan Temple Lang dtemplelang at ucdavis.edu
Tue Jul 16 15:46:25 CEST 2013


Hi Sascha

 Your code gives the correct results on my machine (OS X),
either reading from the file directly or via readLines() and passing
the text to xmlEventParse().

 The problem might be the version of the XML package or your environment
settings.  And it is important to report the session information.
So you should provide the output from

   sessionInfo()
   Sys.getenv()
   libxmlVersion()


 D

On 7/15/13 4:41 AM, Sascha Wolfer wrote:
> Dear list,
> 
> I have got a weird encoding problem with the xmlEventParse() function from the 'XML' package.
> 
> I tried finding an answer on the web for several hours and a Stack Exchange question came back without success :(
> 
> So here's the problem. I created a small XML test file, which looks like this:
> 
> <?xml version="1.0" encoding="iso-8859-1"?>
> <!DOCTYPE testFile>
> <s type="manual">auch der Schulleiter steht dafür zur Verfügung. Das ist seßhaft mit ä und ö...</s>
> 
> This file is encoded with the iso-8859-1 encoding which is also defined in its header.
> 
> I have 3 handler functions, definitions as follows:
> 
> sE2 <- function (name, attrs) {
>   if (name == "s") {
>     get.text <<- T }
> }
> 
> eE2 <- function (name, attrs) {
>   if (name == "s") {
>     get.text <<- F
>   }
> }
> 
> tS2 <- function (content, ...) {
>   if (get.text & nchar(content) > 0) {
>     collected.text <<- c(collected.text, content)
>   }
> }
> 
> I have one wrapper function around xmlEventParse(), definition as follows:
> 
> get.all.text <- function (file) {
>   t1 <- Sys.time()
>   read.file <- paste(readLines(file, encoding = ""), collapse = " ")
>   print(read.file)
>   assign("collected.text", c(), env = .GlobalEnv)
>   assign("get.text", F, env = .GlobalEnv)
>   xmlEventParse(read.file, asText = T, list(startElement = sE2,
>                                            endElement = eE2,
>                                            text = tS2),
>                error = function (...) { },
>                saxVersion = 1)
>   t2 <- Sys.time()
>   cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
>   cat("Result of reading is in variable 'collected.text'.\n")
>   collected.text
> }
> 
> The output of calling get.all.text(<test file>) is as follows:
> [1] "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?> <!DOCTYPE testFile> <s type=\"manual\">auch der Schulleiter steht
> dafür zur Verfügung. Das ist seßhaft mit ä und ö...</s> "
> That took 0 seconds.
> Result of reading is in variable 'collected.text'.
> [1] "auch der Schulleiter steht daf"                        "ür zur Verfügung. Das ist seßhaft mit ä und ö..."
> 
> Now the REALLY weird thing (for me) is that R obviously reads in the file correctly (first output) with 'readLines()'.
> Then this output is passed to xmlEventParse. Afterwards the output is broken and it sometimes also inserts weird breaks
> were special characters occur.
> 
> Do you have any ideas how to solve this problem?
> 
> I cannot use the xmlParse() function because I need the SAX functionality of xmlEventParse(). I also tried reading the
> file with xmlEventParse() directly (with asText = F). No changes...
> 
> Thanks a lot,
> Sascha W.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list