[R] Getting data from a PDF-file into R

hadley wickham h.wickham at gmail.com
Mon Jan 26 16:43:47 CET 2009


On Mon, Jan 26, 2009 at 9:40 AM, Peter Dalgaard
<P.Dalgaard at biostat.ku.dk> wrote:
> joe1985 wrote:
>> Hello
>>
>> I have around 200 PDF-documents, containing data i want organized in R as a
>> dataframe. The PDF-documents look like this;
>>
>>   http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg
>>
>> or like this;
>>
>> http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg
>>
>> So i want to pull out the data in coloured boxes it become organized like
>> this (just in R instead of excel);
>>
>>
>> http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg
>>
>> So the 0'es and 1'es represent when either "PRRS-neg" occurs presented by a
>> 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with
>> "PRRS-pos VAC" or "Vac" presented by a 1 in the colum PRRS-VAC, and
>> "PRRS-pos DK"  or "DK" presented by a 1 in the colum PRRS-DK. And also with
>> "sanVAC" there should be a 1 in the colum VACsan, and with "sanDK" there
>> should be a 1 in the colum DKsan. The first date for each "CHR-nr" should
>> either be the earliest date ne the red box (as in the first picture), or the
>> date with word "før" before the date (as in the second picture). All the 200
>> PDF-documents looks like the ones in the pictures, each reprenting a
>> different "CHR-nr"
>>
>>
>> Hope you can help me
>
> Not on the basis of .jpeg files, I think. We'd need some indication of
> what the PDF looks like inside.  There's a tool called pdftotext, which
> might do something for you, IF you can figure out reliably where your
> data begin and end.

An alternative is to outsource the problem.  You can get very
reasonable data entry quotes from sites like http://www.elance.com/,
and depending on how much you value your time this might end up being
a much cheaper option than figuring out how to do it programmatically
(but not as intellectually satisfying).

Hadley

-- 
http://had.co.nz/




More information about the R-help mailing list