[R] Weird 'xmlEventParse' encoding issue

Sascha Wolfer sascha at cognition.uni-freiburg.de
Mon Jul 15 13:41:04 CEST 2013


Dear list,

I have got a weird encoding problem with the xmlEventParse() function 
from the 'XML' package.

I tried finding an answer on the web for several hours and a Stack 
Exchange question came back without success :(

So here's the problem. I created a small XML test file, which looks like 
this:

<?xml version="1.0" encoding="iso-8859-1"?>
<!DOCTYPE testFile>
<s type="manual">auch der Schulleiter steht dafür zur Verfügung. Das ist 
seßhaft mit ä und ö...</s>

This file is encoded with the iso-8859-1 encoding which is also defined 
in its header.

I have 3 handler functions, definitions as follows:

sE2 <- function (name, attrs) {
   if (name == "s") {
     get.text <<- T }
}

eE2 <- function (name, attrs) {
   if (name == "s") {
     get.text <<- F
   }
}

tS2 <- function (content, ...) {
   if (get.text & nchar(content) > 0) {
     collected.text <<- c(collected.text, content)
   }
}

I have one wrapper function around xmlEventParse(), definition as follows:

get.all.text <- function (file) {
   t1 <- Sys.time()
   read.file <- paste(readLines(file, encoding = ""), collapse = " ")
   print(read.file)
   assign("collected.text", c(), env = .GlobalEnv)
   assign("get.text", F, env = .GlobalEnv)
   xmlEventParse(read.file, asText = T, list(startElement = sE2,
                                            endElement = eE2,
                                            text = tS2),
                error = function (...) { },
                saxVersion = 1)
   t2 <- Sys.time()
   cat("That took", round(difftime(t2,t1, units="secs"), 1), "seconds.\n")
   cat("Result of reading is in variable 'collected.text'.\n")
   collected.text
}

The output of calling get.all.text(<test file>) is as follows:
[1] "<?xml version=\"1.0\" encoding=\"iso-8859-1\"?> <!DOCTYPE testFile> 
<s type=\"manual\">auch der Schulleiter steht dafür zur Verfügung. Das 
ist seßhaft mit ä und ö...</s> "
That took 0 seconds.
Result of reading is in variable 'collected.text'.
[1] "auch der Schulleiter steht daf"                        "ür zur 
Verfügung. Das ist seßhaft mit ä und ö..."

Now the REALLY weird thing (for me) is that R obviously reads in the 
file correctly (first output) with 'readLines()'. Then this output is 
passed to xmlEventParse. Afterwards the output is broken and it 
sometimes also inserts weird breaks were special characters occur.

Do you have any ideas how to solve this problem?

I cannot use the xmlParse() function because I need the SAX 
functionality of xmlEventParse(). I also tried reading the file with 
xmlEventParse() directly (with asText = F). No changes...

Thanks a lot,
Sascha W.



More information about the R-help mailing list