[R] Reading XML files masquerading as XL files

Duncan Temple Lang duncan at wald.ucdavis.edu
Wed Aug 10 23:20:47 CEST 2011


Hi Dennis

 That those files are in a directory/folder suggests that they were extracted from their
zip (.xlsx) file.  The following are the basic contents of the .xlsx file
    1484  02-28-11 12:48   [Content_Types].xml
      733  02-28-11 12:48   _rels/.rels
      972  02-28-11 12:48   xl/_rels/workbook.xml.rels
      846  02-28-11 12:48   xl/workbook.xml
      940  02-28-11 12:48   xl/styles.xml
     1402  02-28-11 12:48   xl/worksheets/sheet2.xml
     7562  02-28-11 12:48   xl/theme/theme1.xml
     1888  02-28-11 12:48   xl/worksheets/sheet1.xml
      470  02-28-11 12:48   xl/sharedStrings.xml
      196  02-28-11 12:48   xl/calcChain.xml
    21316  02-28-11 12:48   docProps/thumbnail.jpeg
      629  02-28-11 12:48   docProps/core.xml
      828  02-28-11 12:48   docProps/app.xml
If most of these are present, I would explore whether the sender could give them to you without
unzipping them or make sure that your software isn't automatically unzipping them for you.


Note that not all files in the .xlsx are sheets and the WorkSheet is the
basic entity that corresponds to a .csv file.

The xlsx package and my REXcelXML packages will probably get you a fair bit of the way
in extracting the content, but they probably will need some tinkering since they expect
the different components to be in a zip archive.
There is also an office2010 package which seems to have an overlap with what is in
xlsx, and ROOXML, RWordXML and RExcelXML.

  D.



On 8/10/11 7:26 AM, Dennis Fisher wrote:
> R version 2.13.1
> OS X (or Windows)
> 
> Colleagues,
> 
> I received a number of files with a .xls extension.  These files open in XL and, by all appearances, are XL files.  However, it appears to me that the files are actually XML:
> 
>> readLines(dir()[16])[1:10]
>  [1] "<?xml version=\"1.0\"?>"                                                    
>  [2] "<Workbook xmlns=\"urn:schemas-microsoft-com:office:spreadsheet\""           
>  [3] " xmlns:o=\"urn:schemas-microsoft-com:office:office\""                       
>  [4] " xmlns:x=\"urn:schemas-microsoft-com:office:excel\""                        
>  [5] " xmlns:ss=\"urn:schemas-microsoft-com:office:spreadsheet\""                 
>  [6] " xmlns:html=\"http://www.w3.org/TR/REC-html40\">"                           
>  [7] " <DocumentProperties xmlns=\"urn:schemas-microsoft-com:office:office\">"    
>  [8] "  <Version>12.0</Version>"                                                  
>  [9] " </DocumentProperties>"                                                     
> [10] " <OfficeDocumentSettings xmlns=\"urn:schemas-microsoft-com:office:office\">"
> 
>  I had initially tried to read the files using read.xls (gdata) but that failed (not surprisingly).  I could open each Excel file, then "save as" csv, then use read.csv.  However, there are many files so I would love to have a solution that does not require this brute force approach.
> 
> Are there any packages that would allow me to read these files without the additional steps?
> 
> Dennis
> 
> 
> Dennis Fisher MD
> P < (The "P Less Than" Company)
> Phone: 1-866-PLessThan (1-866-753-7784)
> Fax: 1-866-PLessThan (1-866-753-7784)
> www.PLessThan.com
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list