[R] Need help extracting info from XML file using XML package

Romain Francois romain.francois at dbmail.com
Mon Mar 2 12:54:37 CET 2009


Hi,

You also might want to check R4X:

# install.packages("R4X", repos="http://R-Forge.R-project.org")
require( "R4X" )
x <- xml("http://code.google.com/apis/kml/documentation/KML_Samples.kml")
coords <- x["////Polygon///coordinates/#" ]
data <- sapply( strsplit( coords, "(,|\\s+)" ), as.numeric )

Romain

-- 
Romain Francois
Independent R Consultant
+33(0) 6 28 91 30 30
http://romainfrancois.blog.free.fr



Wacek Kusnierczyk wrote:
> Don MacQueen wrote:
>   
>> I have an XML file that has within it the coordinates of some polygons
>> that I would like to extract and use in R. The polygons are nested
>> rather deeply. For example, I found by trial and error that I can
>> extract the coordinates of one of them using functions from the XML
>> package:
>>
>>   doc <- xmlInternalTreeParse('doc.kml')
>>   docroot <- xmlRoot(doc)
>>   pgon <-    
>>     
>
> try
>
>     lapply(
>        xpathSApply(doc, '//Polygon',
>           xpathSApply, '//coordinates', function(node)
>               strsplit(xmlValue(node), split=',|\\s+')),
>        as.numeric)
>
> which should find all polygon nodes, extract the coordinates node for
> each polygon separately, split the coordinates string by comma and
> convert to a numeric vector, and then report a list of such vectors, one
> vector per polygon.
>
> i've tried it on some dummy data made up from your example below.  the
> xpath patterns may need to be adjusted, depending on the actual
> structure of your xml file, as may the strsplit pattern.
>
> vQ
>
>
>
>
>
>
>   
>> but this is hardly general!
>>
>> I'm hoping there is some relatively straightforward way to use
>> functions from the XML package to recursively descend the structure
>> and return the text strings representing the polygons into, say, a
>> list with as many elements as there are polygons. I've been looking at
>> several XML documentation files downloaded from
>> http://www.omegahat.org/RSXML/ , but since my understanding of XML is
>> weak at best, I'm having trouble.  I can deal with converting the text
>> strings to an R object suitable for plotting etc.
>>
>>
>> Here's a look at the structure of this file
>>
>> graphics[5]% grep Polygon doc.kml
>>         <Polygon id="15342">
>>         </Polygon>
>>         <Polygon id="1073">
>>         </Polygon>
>>         <Polygon id="16508">
>>         </Polygon>
>>         <Polygon id="18665">
>>         </Polygon>
>>         <Polygon id="32903">
>>         </Polygon>
>>         <Polygon id="5232">
>>         </Polygon>
>>
>> And each of the <Polygon> </Polygon> pairs has <coordinates> as per
>> this example:
>>
>>
>>     <Polygon id="15342">
>>         <outerBoundaryIs>
>>             <LinearRing id="11467">
>>                 <coordinates>
>> -23.679835352296,30.263840290388,5.000000000000001
>> -23.68138782285701,30.264740875186,5.000000000000001
>>    [snip]
>> -23.679835352296,30.263840290388,5.000000000000001
>> -23.679835352296,30.263840290388,5.000000000000001 </coordinates>
>>             </LinearRing>
>>         </outerBoundaryIs>
>>     </Polygon>
>>
>>
>> Thanks!
>> -Don
>>
>>
>> p.s.
>> There is a lot of other stuff in this file, i.e, some points, and
>> attributes of the points such as color, as well as a legend describing
>> what the polygons mean, but I can get by without all that stuff, at
>> least for now.
>>
>> Note also that readOGR() would in principle work, but the underlying
>> OGR libraries have some limitations that this file exceeds. Per info
>> at http://www.gdal.org/ogr/drv_kml.html.
>>




More information about the R-help mailing list