[R] parsing pdf files

Mark Wardle mark at wardle.org
Sun Jan 10 12:11:34 CET 2010


If you can use a R <-> java interface, you could use itext to do this
as long as the PDF is fairly sane.

see http://itextpdf.com/

It is what pdftk uses.

b/w

Mark

2010/1/9 David Kane <dave at kanecap.com>:
> I have a pdf file that I would like to parse into R:
>
> http://www.williams.edu/Registrar/geninfo/faculty.pdf
>
> For now, I open the file in Acrobat by hand, then save it "as text"
> and then use readLines(). That works fine but a) I am concerned that
> some information may be lost and b) I may be doing this a lot, so I
> would rather have R grab the information from the pdf file directly.
>
> So: is there something like readPDF() for R?
>
> Thanks,
>
> Dave Kane
>
> PS. If you're curious, here is the sort of work that I want to do with
> this data:
> http://www.ephblog.com/2010/01/08/class-update-and-faculty-ages/
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>



-- 
Dr. Mark Wardle
Specialist registrar, Neurology
Cardiff, UK



More information about the R-help mailing list