[R] Extract lines from pdf files

Wed Nov 20 23:53:13 CET 2019

I think you are more likely to get a helpful answer if you give a minimal
example of what your lines look like. I certainly don't have a clue, though
maybe someone else will.

Cheers,
Bert

On Wed, Nov 20, 2019 at 12:21 PM Thomas Subia via R-help <
r-help using r-project.org> wrote:

> Thanks all for the help. I appreciate the feedback
> I've developed another method to extract my desired data from multiple
> pdfs in a directory.
>
> # Combine all pdfs to a combined pdf
> files <- list.files(pattern = "pdf$")
> pdf_combine(files, output = "joined.pdf")
>
> # creates a text file from joined.pdf
> pdf_text("joined.pdf")
> txt <- pdf_text("joined.pdf")
> write.table(txt,file="mydata.txt")
>
> # I need to extract the lines which match a line beginning with AMAT
> lines <- readLines("mydata.txt")
> date <- grep("AMAT",lines)
>
> # output for date looks like [1]   6  62 118 174 230 286 342 398
> # These are exactly the line positions I need.
>
> Now that I've got the desired lines, I don't know how to extract the data
> from those lines.
>
> Any advice would be appreciated.
>
> All the best,
>
> Thomas Subia
> Statistician / Quality Engineer
> IMG Precision Inc.
>
>
>
>
>
>
>
>
> On Wednesday, November 20, 2019, 07:58:08 AM PST, Eric Berger <
> ericjberger using gmail.com> wrote:
>
>
>
>
>
> Hi Thomas,
> As Jeff wrote, your HTML email is difficult to read. This is a "plain
> text" forum.
> As for "pointers", here is one suggestion.
> Since you write that you can do the necessary actions with a specific
> file, try to write a function that carries out those actions for that
> same file.
> Except when implementing the function, replace any specific data with
> the value of an argument passed into the function.
> e.g.
> txt <- pdf_text("10619.pdf")
> would be replaced by
> txt <- pdf_text(pdfFile)
>
> and your function would have pdfFile as an argument, as in
>
> myfunc <- function( pdfFile )
>
> Since you can accomplish the task for this file without a function,
> you should be able to accomplish the task with a function.
> Once you succeed to do that you can then try passing the function
> arguments that refer to the other files you need to process.
>
> HTH,
> Eric
>
>
> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> wrote:
> >
> > Please don't spam the mailing list. Especially with HTML format
> messages. See the Posting Guide.
> >
> > PDF is designed to present data graphically. It is literally possible to
> place every character in the page in random order and still achieve this
> visual readability while practically making it nearly impossible to read. I
> have encountered many PDF files with the same text placed on the page
> multiple times... again scrambling your option to read it digitally. Tools
> like "pdftools" can sometimes work when the program that generated the file
> does so in a simple and extraction-friendly way... but there are no
> guarantees, and your description suggests that it is likely that you won't
> be able to accomplish your goal with this file.
> >
> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help <
> r-help using r-project.org> wrote:
> > >
> > >Colleagues,
> > >
> > >
> > >
> > >I can extract specific data from lines in a pdf using:
> > >
> > >
> > >
> > >library(pdftools)
> > >
> > >pdf_text("10619.pdf")
> > >
> > >txt <- pdf_text(".pdf")
> > >
> > >write.table(txt,file="mydata.txt")
> > >
> > >con <- file('mydata.txt')
> > >
> > >open(con)
> > >
> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> > >read.table(con,skip=11,nrow=1)# Extract [5]
> > >
> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> > >
> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> > >
> > >close(con)
> > >
> > >
> > >
> > ># note here that serial has 4 variables
> > >
> > ># flatness had 6 variables
> > >
> > ># parallel1 has 5 variables
> > >
> > ># parallel2 has 5 variables
> > >
> > >
> > >
> > ># this outputs the specific data I need
> > >
> > >serial[3]
> > >
> > >flatness[5]
> > >
> > >parallel1[5] # Note here that the txt format shows 0.0007not
> > >scientific, is there a way to format this to display the original data?
> > >
> > >parallel2[5] # Note here that the txt format shows 0.0006not
> > >scientific, , is there a way to format this to display the original
> > >data?
> > >
> > >
> > >
> > >I'd like to extend this code to all of the pdf files in adirectory and
> > >to generate a table of all the serial, flatness, parallel1 andparallel2
> > >data.
> > >
> > >I'm not having a lot of success trying to build thescript for this.
> > >Some pointers would be appreciated.
> > >All the best.
> > >
> > >Thomas Subia
> > >
> > >Statistician / Senior Quality Engineer
> > >
> > >
> > >
> > >      [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
>
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]