[R] Need help extracting info from XML file using XML package

Duncan Temple Lang duncan at wald.ucdavis.edu
Mon Mar 2 14:14:05 CET 2009



Wacek Kusnierczyk wrote:
> Don MacQueen wrote:
>> I have an XML file that has within it the coordinates of some polygons
>> that I would like to extract and use in R. The polygons are nested
>> rather deeply. For example, I found by trial and error that I can
>> extract the coordinates of one of them using functions from the XML
>> package:
>>
>>   doc <- xmlInternalTreeParse('doc.kml')
>>   docroot <- xmlRoot(doc)
>>   pgon <-    
> 
> try
> 
>     lapply(
>        xpathSApply(doc, '//Polygon',
>           xpathSApply, '//coordinates', function(node)
>               strsplit(xmlValue(node), split=',|\\s+')),
>        as.numeric)


Just for the record, I the xpath expression in the
second xpathSApply would need to be
    ".//coordinates"
to start searching from the previously matched Polygon node.
Otherwise, the search starts from the top of the document again.

However, it would seem that

   xpathSApply(doc, "//Polygon//coordinates",
                 function(node) strsplit(.....))

would be more direct, i.e. fetch the coordinates nodes in single
XPath expression.

   D.




> 
> which should find all polygon nodes, extract the coordinates node for
> each polygon separately, split the coordinates string by comma and
> convert to a numeric vector, and then report a list of such vectors, one
> vector per polygon.
> 
> i've tried it on some dummy data made up from your example below.  the
> xpath patterns may need to be adjusted, depending on the actual
> structure of your xml file, as may the strsplit pattern.
> 
> vQ
> 
> 
> 
> 
> 
> 
>> but this is hardly general!
>>
>> I'm hoping there is some relatively straightforward way to use
>> functions from the XML package to recursively descend the structure
>> and return the text strings representing the polygons into, say, a
>> list with as many elements as there are polygons. I've been looking at
>> several XML documentation files downloaded from
>> http://www.omegahat.org/RSXML/ , but since my understanding of XML is
>> weak at best, I'm having trouble.  I can deal with converting the text
>> strings to an R object suitable for plotting etc.
>>
>>
>> Here's a look at the structure of this file
>>
>> graphics[5]% grep Polygon doc.kml
>>         <Polygon id="15342">
>>         </Polygon>
>>         <Polygon id="1073">
>>         </Polygon>
>>         <Polygon id="16508">
>>         </Polygon>
>>         <Polygon id="18665">
>>         </Polygon>
>>         <Polygon id="32903">
>>         </Polygon>
>>         <Polygon id="5232">
>>         </Polygon>
>>
>> And each of the <Polygon> </Polygon> pairs has <coordinates> as per
>> this example:
>>
>>
>>     <Polygon id="15342">
>>         <outerBoundaryIs>
>>             <LinearRing id="11467">
>>                 <coordinates>
>> -23.679835352296,30.263840290388,5.000000000000001
>> -23.68138782285701,30.264740875186,5.000000000000001
>>    [snip]
>> -23.679835352296,30.263840290388,5.000000000000001
>> -23.679835352296,30.263840290388,5.000000000000001 </coordinates>
>>             </LinearRing>
>>         </outerBoundaryIs>
>>     </Polygon>
>>
>>
>> Thanks!
>> -Don
>>
>>
>> p.s.
>> There is a lot of other stuff in this file, i.e, some points, and
>> attributes of the points such as color, as well as a legend describing
>> what the polygons mean, but I can get by without all that stuff, at
>> least for now.
>>
>> Note also that readOGR() would in principle work, but the underlying
>> OGR libraries have some limitations that this file exceeds. Per info
>> at http://www.gdal.org/ogr/drv_kml.html.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list