[R] Reading PDF files

Tony B tony.breyal at googlemail.com
Tue Dec 22 20:29:42 CET 2009


Copied/pasted from my earlier reply:

 It's been a while since I've
done this, but if memory serves, the basic process was to download
xpdf and add it to the windows path, thus making it accessable from
within R. Two methods follow:

Method One (easiest) - using the awesome ?system command:

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it

# system(paste("[app]", "[pdf file]"), wait = FALSE)

> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/tony/Desktop/test/r-intro.pdf"'), wait=FALSE)

Method Two - if you want to use the tm package like I did last year,
?readPDF requires the following (not documented anywhere that I know
of, but this is what you do):

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it
(3) Download the Redmond utility for adding files to your windows path
(free version button is in the top left of the page):
http://redmondlab.googlepages.com/path
(4) Unzip it
(5) Open the 'Redmond Path' application.
(6) Click on the green plus in the top left hand corner '+'.
(7) Naviagate to the folder which contains the files: C:/../
xpdf-3.02pl4-win32
(8) Add it and click Ok.

Then you can can do something like:

> library(tm)
> my.path <- 'C:\\Documents and Settings\\tony\\Desktop\\pdfs\\' #put your pdfs in here
> Corpus(DirSource(my.path), readerControl = list(reader=readPDF))

There are some limitations to how well the conversions work depending
on the pdf file, but it was so long ago now that I'm afraid I don't
remember the details.

HTH.
Tony Breyal

On 22 Dec, 18:51, "Eusufzai, Zaki" <zeusu... at lmu.edu> wrote:
> Hi:
>
> I need to do text mining on PDF files. I understand there is a readPDF
> command in tm that can be used. Have read the 2008 posts on converting
> PDF files to text by Tony Breyal and others.
>
> Wondering if the procedure has been standardized in any tutorial or
> otherwise? Being new to R, I was able to follow only part of the
> discussion.
>
> Any way to get a set of step by step instructions appropriate for my
> level? I am an ageing academic who has worked mostly with SAS and
> MATLAB.
>
> Thanks.
>
> Zaki Eusufzai
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-h... at r-project.org mailing listhttps://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list