[R] package tm: reading XML files

Ad Feelders A.J.Feelders at uu.nl
Tue May 29 10:03:58 CEST 2012


Dear fellow R users,

I'm using the package tm for text mining, and have a problem with 
reading in a corpus from XML files.
When I copy the example from "Introduction to the tm package" of the 
small reuters subset "crude", everything goes well, and I get a corpus 
with the required meta data.
When I read in the entire reuters21578 corpus in XML format however (or 
a self-created subset thereof) the meta data is lost, and the files are 
interpreted as plain text.
I use the following command, where the indicated directory contains all 
reuters 21578 documents as separate XML files:

 > reuters21578 <- Corpus(DirSource("C:/Data/Reuters/preprocessed"), 
readerContol=list(reader=readReut21578XML))

I'm running R2.15.0 under Windows XP.

Has anybody else encountered this problem and found a cause/solution.

Best regards,

-Ad Feelders



More information about the R-help mailing list