[R] Getting data from a PDF-file into R

Mon Jan 26 16:40:09 CET 2009

joe1985 wrote:
> Hello
> 
> I have around 200 PDF-documents, containing data i want organized in R as a
> dataframe. The PDF-documents look like this;
> 
>   http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver.jpeg 
> 
> or like this;
> 
> http://www.nabble.com/file/p21667074/PRRS-billede%2Bmed%2Bfarver%2B2.jpeg 
> 
> So i want to pull out the data in coloured boxes it become organized like
> this (just in R instead of excel);
> 
> 
> http://www.nabble.com/file/p21667074/PRRS-billede%2Bexcel.jpeg 
> 
> So the 0'es and 1'es represent when either "PRRS-neg" occurs presented by a
> 0 in the colums PRRS-VAC and PRRS-DK on a particular date. And the same with
> "PRRS-pos VAC" or "Vac" presented by a 1 in the colum PRRS-VAC, and
> "PRRS-pos DK"  or "DK" presented by a 1 in the colum PRRS-DK. And also with
> "sanVAC" there should be a 1 in the colum VACsan, and with "sanDK" there
> should be a 1 in the colum DKsan. The first date for each "CHR-nr" should
> either be the earliest date ne the red box (as in the first picture), or the
> date with word "før" before the date (as in the second picture). All the 200
> PDF-documents looks like the ones in the pictures, each reprenting a
> different "CHR-nr"
> 
> 
> Hope you can help me

Not on the basis of .jpeg files, I think. We'd need some indication of
what the PDF looks like inside.  There's a tool called pdftotext, which
might do something for you, IF you can figure out reliably where your
data begin and end.

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark      Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)              FAX: (+45) 35327907