[R] SAX Parser best practise

Wed Sep 21 17:43:57 CEST 2005

Jan Hummel wrote:
> Dear All,
> 
> I have a question regarding best practise in setting up a XML parser
> within R. 
> Because I have files with more than 100 MB and I'm only interested in
> some values I think a SAX-like parser using xmlEventParse() will be the
> best solution.
> Unfortunately the values I'm looking for, to construct some higher "mass
> spectrum", are distributed over different lines: as <spectrum id="2">,
> <mzArrayBinary>, <intenArrayBinary> <... name="MassToChargeRatio"
> value="445.598999"/> (as one can see in the xml snip set)
> 
> I know the mechanism of using Event Handlers, as shown in the examples.
> But what I'm looking for is, how can I use some "path information" as
> mentioned in "addContext" parameter of xmlEventParse()? May somebody
> share a example using "addContext = TRUE" and pointing me to the
> variables I may use if I implement the "..." parameter within my
> handlers.
> 

The addContext was an attempt to provide contextual information,
but it is not obvious how to do this efficiently. And of course
efficiency is the name of the game with the SAX model.
If we wanted to know path information for the node, we would have
to build this and that would slow things down. There are no nodes
in the SAX world as we don't build the tree in any way.
So the addContext currently doesn't do much. It is there
as a hook that we can use if we want in the future.
But you can do anything you need in the R code.

> Do I have to implement a "status machine" using some variables within my
> handlers, or would one prefer to use the "state" parameter of
> xmlEventParse()?
> 

As Seth mentioned in his reply, you can use state in your R handler 
functions to determine where you are.  You can maintain a "stack"
to determine the exact path of the current "node" in the
startElement() handler and pop the name in the endElement() handler.

The difference between maintaining state via environments/local 
persisten scope (using <<- in Seth's mail) and using the state argument
is more of a personal preference in R.  The state argument was added
for S-Plus since it does not support environments.  Using the state
argument might save an epsilon amount of time, but it is hopefully
neglible.

BTW, do you have a schema for the XML document you are working on?

> I would appreciate any assistance very much!
> 	Jan
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html