[R] Reading PDF files (using xpdf)

Tony Breyal tony.breyal at googlemail.com
Tue Dec 22 13:03:40 CET 2009


Greetings Zaki,

You should really post this question on the R-help forum so that
others might benefit from any responses. It's been a while since I've
done this, but if memory serves, the basic process was to download
xpdf and add it to the windows path, thus making it accessable from
within R. Two methods follow:

Method One (easiest) - using the awesome ?system command:

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it

# system(paste("[app]", "[pdf file]"), wait = FALSE)
> system(paste('"C:/Program Files/xpdf/pdftotext.exe"', '"C:/Documents and Settings/tony/Desktop/test/r-intro.pdf"'), wait=FALSE)


Method Two - if you want to use the tm package like I did last year,
?readPDF requires the following (not documented anywhere that I know
of, but this is what you do):

(1) Download xpdf (whichever is the latest version):
ftp://ftp.foolabs.com/pub/xpdf/xpdf-3.02pl4-win32.zip
(2) Unzip it
(3) Download the Redmond utility for adding files to your windows path
(free version button is in the top left of the page):
http://redmondlab.googlepages.com/path
(4) Unzip it
(5) Open the 'Redmond Path' application.
(6) Click on the green plus in the top left hand corner '+'.
(7) Naviagate to the folder which contains the files: C:/../xpdf-3.02pl4-win32
(8) Add it and click Ok.

Then you can can do something like:
> library(tm)
> my.path <- 'C:\\Documents and Settings\\tony\\Desktop\\pdfs\\' #put your pdfs in here
> Corpus(DirSource(my.path), readerControl = list(reader=readPDF))

There are some limitations to how well the conversions work depending
on the pdf file, but it was so long ago now that I'm afraid I don't
remember the details.

HTH.
Tony Breyal




2009/12/22  <zeusufza at lmu.edu>:
> Hi:
>
> I am very new to R. I just read through your 2008 posts on converting PDF files to text. I have exactly the same goal.
>
> Has the procedure been standardized in any tutorial? I was able to follow only part of the discussion. Any way to get a set of step by step instructions?
>
> Thanks.
> Zaki Eusufzai
>



-- 
Tony Breyal




More information about the R-help mailing list