[R] Extract lines from pdf files

Bert Gunter bgunter@4567 @end|ng |rom gm@||@com
Wed Nov 20 20:24:52 CET 2019


rather, system.time() of course.

Bert Gunter

"The trouble with having an open mind is that people keep coming along and
sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Wed, Nov 20, 2019 at 11:19 AM Bert Gunter <bgunter.4567 using gmail.com> wrote:

> Don't do this (the timing code you showed, not Eric's suggestions).
>
> Do this:
>
> sys.time( {
>     code of interest
> })
>
> (or use the microbenchmark package functionality)
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along and
> sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Wed, Nov 20, 2019 at 10:56 AM Jeff Reichman <reichmanj using sbcglobal.net>
> wrote:
>
>> Eric
>>
>> I will have to give that a try. Thanks.
>> For a "it works" method I used ....
>>
>> start_time <- Sys.time()
>>
>>         insert code of interest
>>
>> end_time <- Sys.time()
>> end_time - start_time
>>
>> -----Original Message-----
>> From: R-help <r-help-bounces using r-project.org> On Behalf Of Eric Berger
>> Sent: Wednesday, November 20, 2019 9:58 AM
>> To: Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> Cc: Thomas Subia <tgs77m using yahoo.com>; Thomas Subia via R-help
>> <r-help using r-project.org>
>> Subject: Re: [R] Extract lines from pdf files
>>
>> Hi Thomas,
>> As Jeff wrote, your HTML email is difficult to read. This is a "plain
>> text"
>> forum.
>> As for "pointers", here is one suggestion.
>> Since you write that you can do the necessary actions with a specific
>> file,
>> try to write a function that carries out those actions for that same file.
>> Except when implementing the function, replace any specific data with the
>> value of an argument passed into the function.
>> e.g.
>> txt <- pdf_text("10619.pdf")
>> would be replaced by
>> txt <- pdf_text(pdfFile)
>>
>> and your function would have pdfFile as an argument, as in
>>
>> myfunc <- function( pdfFile )
>>
>> Since you can accomplish the task for this file without a function, you
>> should be able to accomplish the task with a function.
>> Once you succeed to do that you can then try passing the function
>> arguments
>> that refer to the other files you need to process.
>>
>> HTH,
>> Eric
>>
>>
>> On Wed, Nov 20, 2019 at 1:09 AM Jeff Newmiller <jdnewmil using dcn.davis.ca.us>
>> wrote:
>> >
>> > Please don't spam the mailing list. Especially with HTML format
>> messages.
>> See the Posting Guide.
>> >
>> > PDF is designed to present data graphically. It is literally possible to
>> place every character in the page in random order and still achieve this
>> visual readability while practically making it nearly impossible to read.
>> I
>> have encountered many PDF files with the same text placed on the page
>> multiple times... again scrambling your option to read it digitally. Tools
>> like "pdftools" can sometimes work when the program that generated the
>> file
>> does so in a simple and extraction-friendly way... but there are no
>> guarantees, and your description suggests that it is likely that you won't
>> be able to accomplish your goal with this file.
>> >
>> > On November 19, 2019 11:52:20 PM GMT+01:00, Thomas Subia via R-help
>> <r-help using r-project.org> wrote:
>> > >
>> > >Colleagues,
>> > >
>> > >
>> > >
>> > >I can extract specific data from lines in a pdf using:
>> > >
>> > >
>> > >
>> > >library(pdftools)
>> > >
>> > >pdf_text("10619.pdf")
>> > >
>> > >txt <- pdf_text(".pdf")
>> > >
>> > >write.table(txt,file="mydata.txt")
>> > >
>> > >con <- file('mydata.txt')
>> > >
>> > >open(con)
>> > >
>> > >serial <- read.table(con,skip=5,nrow=1) #Extract[3]flatness <-
>> > >read.table(con,skip=11,nrow=1)# Extract [5]
>> > >
>> > >parallel1 <-read.table(con,skip=2,nrow=1)# Extract [5]
>> > >
>> > >parallel2 <-read.table(con,skip=4,nrow=1)# Extract [5]
>> > >
>> > >close(con)
>> > >
>> > >
>> > >
>> > ># note here that serial has 4 variables
>> > >
>> > ># flatness had 6 variables
>> > >
>> > ># parallel1 has 5 variables
>> > >
>> > ># parallel2 has 5 variables
>> > >
>> > >
>> > >
>> > ># this outputs the specific data I need
>> > >
>> > >serial[3]
>> > >
>> > >flatness[5]
>> > >
>> > >parallel1[5] # Note here that the txt format shows 0.0007not
>> > >scientific, is there a way to format this to display the original data?
>> > >
>> > >parallel2[5] # Note here that the txt format shows 0.0006not
>> > >scientific, , is there a way to format this to display the original
>> > >data?
>> > >
>> > >
>> > >
>> > >I'd like to extend this code to all of the pdf files in adirectory
>> > >and to generate a table of all the serial, flatness, parallel1
>> > >andparallel2 data.
>> > >
>> > >I'm not having a lot of success trying to build thescript for this.
>> > >Some pointers would be appreciated.
>> > >All the best.
>> > >
>> > >Thomas Subia
>> > >
>> > >Statistician / Senior Quality Engineer
>> > >
>> > >
>> > >
>> > >       [[alternative HTML version deleted]]
>> > >
>> > >______________________________________________
>> > >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > >https://stat.ethz.ch/mailman/listinfo/r-help
>> > >PLEASE do read the posting guide
>> > >http://www.R-project.org/posting-guide.html
>> > >and provide commented, minimal, self-contained, reproducible code.
>> >
>> > --
>> > Sent from my phone. Please excuse my brevity.
>> >
>> > ______________________________________________
>> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list