[R] read data from pdf file

Thomas Schönhoff tschoenhoff at gmail.com
Fri Oct 21 21:39:47 CEST 2005


2005/10/21, Ted Harding <Ted.Harding at nessie.mcc.ac.uk>:
> On 21-Oct-05 Marco Venanzi wrote:
> > Hi, I'm trying to read data from a PDF file.Is it possible to do it
> > with R? Thanks,  Marco
>
> Basically, No.
>
> But you may be lucky with "copy&paste" using the mouse, from
> the display generated in Acrobat Reader to a text file.
>
> The basic procedure here is
>
> 1. Click on the "Text Select Tool" (a button usually marked with a "T");
>
> 2. Use the mouse to highlight the block of text you want to copy;
>
> 3. Depending on your operating system/graphics display: In Windows
>    you have (IIRC) to go to "Edit"->""Copy"; in Unlix/Linux with
>    X Windows do nothing;
>
> 4. "Paste" it into your text file, again as appropriate for your
>    operating system.
>
> However, you may not be lucky.
>
> PDF can store its content in stange ways, and what may look on the
> screen like contiguous and consecutive text is stored internally
> in separate "blocks" (what PDF calls "objects"). And this can apply
> even to little bits of text in a paragraph.
>
> When you paste the marked text, it will go in in the order that
> PDF finds the blocks in the file. As a result, your text file
> may contain bits of text in random order.
>
> This especially applies to things arranged in tables. But it
> very much depends on the software that generated the PDF in
> the first place.
>
> Since often the data in a PDF file which you may want to copy
> in this way will be tabular, you are likely to encounter this
> problem!
>
> You can tell this is going to happen when you use the mouse to
> highlight the text you intend to copy: starting with the mouse
> iin say the top LH corner, move it slowly towards the lower
> RH corner of the block. If the highlighting jumps all over the
> screen, and/or outside the area you are trying to highlight,
> then this is what's happening.
>
> In that case I have sometimes done it by copying lots of little
> blocks, too small to provoke the effect. But this is very tedious.
>
> There are other things one can try, such as printing from the
> PDF file to a PostScript file, and then using a program like
> ps2ascii (which can deal directly with PDF) or pstotext; but frankly
> no such program is likely to make a good job of this, because of
> the way PS and PDF work.
>
> Sorry to appear unhelpful! But you may get somewhere.

Hmm, if this doesn't work you should have a look to pdftolpe, which is
assumed to convert aribitrary PDF files to some LPE readable format.
LPE is a lightweight programmer's editor, that should be able save the
converted file into txt format.

I never used this myself, though. In case you are running Windows my
reply might not be of much help, sorry for that!

good luck

Thomas




More information about the R-help mailing list