[R] text mining analysis and word visualization of pdfs

Mike Marchywka marchywka at hotmail.com
Thu May 19 13:26:25 CEST 2011













----------------------------------------
Date: Wed, 18 May 2011 15:24:49 +0530
From: ashimkapoor at gmail.com
To: karl at huftis.org
CC: r-help at stat.math.ethz.ch
Subject: Re: [R] text mining analysis and word visualization of pdfs


On Wed, May 18, 2011 at 1:44 PM, Karl Ove Hufthammer wrote:

> Ajay Ohri wrote:
>
> > What is the appropriate software package for dumping say 20 PDFS in a
> > folder, then creating data visualization with frequency counts of
> > certain words as well as measure correlation within each file for
> > certain key relationships or key words.
>
> pdftotext + Unix™ for Poets + R (ggplot2)
>
> What about the tm package ? I am a beginner and I don't know much about
this but I recall that it does have the ability to handle PDF's. A few words
from the experts would be nice.

I don;t know if I'm an expert, I can't even get a browser that echo's
keystrokes in a reasonable time with 4 core CPU on 'dohs, but PDF
could mean just about anything in terms of how text is respresented. Whatever
R packages do, they will not be able to read the mind of the author.
Even with pdftotext, there are many options and even simple things like
US IRS instruction forms can be almost impossible to extract in a coherent
manner. Many authors could care less about the information as long as the
thing looks like paper copy. If you are stuck with PDF, I'd be looking
for more tools first as you will probably want to know how they are constrcuted. 

I would just reiterate that the best approach for many data analysts would
be to contact data source explaining problems with improperly authored PDF or
other specialized file format that are only supported by limited proprietary tools
or that obfuscate information of interest. 


  









 		 	   		  


More information about the R-help mailing list