[R] SOLVED: importing huge XML-Files -- new problem: special characters

Duncan Temple Lang duncan at wald.ucdavis.edu
Tue Sep 4 19:15:35 CEST 2007


Alexander Heidrich wrote:
> Hi all,
> 
> thanks to the people who replied to my question! I finally solved the  
> issue by writing own handlers and using xmlEventParse - which leads  
> to the following problem which is so odd that its probably a bug.
> 
> I use several special charachter in my XML-File, e.g. umlauts or ° or  
> µ - but no matter how I encode my XML (UTF or ISO) or I escape these  
> characters xmlEventParse always stops parsing after the first umlaut  
> and pretends to have more than one node even if there is really just  
> one!
> 
> Example:
> 
> <locations>abc	aböcd	abdec</locations>
> 
> causes two events for locations and produces output in the form of:
> 
> 	[,1]	[,2]	[,3]
> [1,]	abc
> [2,]	aböcd	abdec
> 

Well, your output is particular to your text event handlers so 
what you show us does not tell us what were the inputs.
If you have two events and you got "abc     "
and "abocd	abdec" (or the trailing spaces from the first
appeared on the second and not the first), that would not
suprise me. 

The underlying XML parser is extracting content from a stream
of bytes. It makes no guarantee that contiguous text
content is delivered in a single event to the handlers.
Instead, it consumes as much of the stream as it wants
and delivers that and then continues from where it left off
in the stream. If it encounters a text node with a large amount
of text, it will deliver that in smaller chunks. 

This undoubtedly makes the processing of the stream slightly harder
for the handler as it has to remember where it "was", but this is true
of all handlers so not a significant burden.

The branches parameter of the xmlEventParse() function does provide
a way to mix SAX/event parsing with the easier DOM/node style parsing.

 D.

> 
> Should it be like that? If I remove the umlauts, than everything is  
> fine!
> 
> If I do the following:
> 
> <locations>öabc	aböcd	abdec</locations>
> 
> the output is
> 
> 	[,1]	[,2]	[,3]
> [1,]	öabc	aböcd	abdec
> 
> Any suggestions?
> 
> Thanks in advance and many greetings!
> 
> Alex
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Duncan Temple Lang                duncan at wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
4210 Mathematical Sciences Bldg.  fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA



-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 189 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-help/attachments/20070904/7152998e/attachment.bin 


More information about the R-help mailing list