[R] Extract lines from pdf files

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Wed Nov 20 20:19:31 CET 2019


Don't do this (the timing code you showed, not Eric's suggestions).

Do this:

sys.time( {
    code of interest
})

(or use the microbenchmark package functionality)

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Nov 20, 2019 at 10:56 AM Jeff Reichman <reichmanj using sbcglobal.net>
wrote:

> Eric
>
> I will have to give that a try. Thanks.
> For a "it works" method I used ....
>
> start_time <- Sys.time()
>
>         insert code of interest
>
> end_time <- Sys.time()
> end_time - start_time
>
> -----Original Message-----
> From: R-help <r-help-bounces using r-project.org> On Behalf Of Eric Berger
> Sent: Wednesday, November 20, 2019 9:58 AM
> To: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> Cc: Thomas Subia <tgs77m using yahoo.com>; Thomas Subia via R-help
> <r-help using r-project.org>
> Subject: Re: [R] Extract lines from pdf files
>
> Hi Thomas,
> As Jeff wrote, your HTML email is difficult to read. This is a "plain text"
> forum.
> As for "pointers", here is one suggestion.
> Since you write that you can do the necessary actions with a specific file,
> try to write a function that carries out those actions for that same file.
> Except when implementing the function, replace any specific data with the
> value of an argument passed into the function.
> e.g.
> txt <- pdf_text("10619.pdf")
> would be replaced by
> txt <- pdf_text(pdfFile)
>
> and your function would have pdfFile as an argument, as in
>
> myfunc <- function( pdfFile )
>
> Since you can accomplish the task for this file without a function, you
> should be able to accomplish the task with a function.
> Once you succeed to do that you can then try passing the function arguments
> that refer to the other files you need to process.
>
> HTH,
> Eric
>
>
> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
> wrote:
> >
> > Please don't spam the mailing list. Especially with HTML format messages.
> See the Posting Guide.
> >
> > PDF is designed to present data graphically. It is literally possible to
> place every character in the page in random order and still achieve this
> visual readability while practically making it nearly impossible to read. I
> have encountered many PDF files with the same text placed on the page
> multiple times... again scrambling your option to read it digitally. Tools
> like "pdftools" can sometimes work when the program that generated the file
> does so in a simple and extraction-friendly way... but there are no
> guarantees, and your description suggests that it is likely that you won't
> be able to accomplish your goal with this file.
> >
> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help
> <r-help using r-project.org> wrote:
> > >
> > >Colleagues,
> > >
> > >
> > >
> > >I can extract specific data from lines in a pdf using:
> > >
> > >
> > >
> > >library(pdftools)
> > >
> > >pdf_text("10619.pdf")
> > >
> > >txt <- pdf_text(".pdf")
> > >
> > >write.table(txt,file="mydata.txt")
> > >
> > >con <- file('mydata.txt')
> > >
> > >open(con)
> > >
> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
> > >read.table(con,skip=11,nrow=1)# Extract [5]
> > >
> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
> > >
> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
> > >
> > >close(con)
> > >
> > >
> > >
> > ># note here that serial has 4 variables
> > >
> > ># flatness had 6 variables
> > >
> > ># parallel1 has 5 variables
> > >
> > ># parallel2 has 5 variables
> > >
> > >
> > >
> > ># this outputs the specific data I need
> > >
> > >serial[3]
> > >
> > >flatness[5]
> > >
> > >parallel1[5] # Note here that the txt format shows 0.0007not
> > >scientific, is there a way to format this to display the original data?
> > >
> > >parallel2[5] # Note here that the txt format shows 0.0006not
> > >scientific, , is there a way to format this to display the original
> > >data?
> > >
> > >
> > >
> > >I'd like to extend this code to all of the pdf files in adirectory
> > >and to generate a table of all the serial, flatness, parallel1
> > >andparallel2 data.
> > >
> > >I'm not having a lot of success trying to build thescript for this.
> > >Some pointers would be appreciated.
> > >All the best.
> > >
> > >Thomas Subia
> > >
> > >Statistician / Senior Quality Engineer
> > >
> > >
> > >
> > >       [[alternative HTML version deleted]]
> > >
> > >______________________________________________
> > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > >https://stat.ethz.ch/mailman/listinfo/r-help
> > >PLEASE do read the posting guide
> > >http://www.R-project.org/posting-guide.html
> > >and provide commented, minimal, self-contained, reproducible code.
> >
> > --
> > Sent from my phone. Please excuse my brevity.
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list