[R] Reading a web page in pdf format

Vittorio De Martino vittorio at de-martino.it
Thu May 10 15:54:36 CEST 2007


Great! It's a wonderful mailing list  full of helpful people!
Thanks to all of you

Vittorio

Il Wednesday 09 May 2007 18:57:39 Gabor Grothendieck ha scritto:
> Here is one additional solution.  This one produces a data frame.  The
> regular expression removes:
>
> - everything from beginning to first (
> - everything from last ( to end
> - everything between ) and ( in the middle
>
> The | characters separate the three parts.  Then read.table reads it in.
>
>
> URL <-
> "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
> Lines.raw <- readLines(URL)
> Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
>
> rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]"
> read.table(textConnection(gsub(rx, "", Lines)), dec = ",")
>
> On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> > Modify this to suit.  After grepping out the correct lines we use
> > strapply to find and emit character sequences that come after a "(" but
> > do not contain a ")" .  back = -1 says to only emit the backreferences
> > and not the entire matched expression (which would have included the
> > leading "(" ):
> >
> > URL <-
> > "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pd
> >f" Lines.raw <- readLines(URL)
> > Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
> > library(gsubfn)
> > strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
> >
> > which gives a character matrix whose first column is the label
> > and second column is the number in character form.  You can
> > then manipulate it as desired.
> >
> > On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
> > > Each day the daily balance in the following link
> > >
> > > http://www.
> > > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> > >
> > > is
> > > updated.
> > >
> > > I would like to set up an R procedure to be run daily in a
> > > server able to read the figures in a couple of lines only
> > > ("Industriale" and "Termoelettrico", towards the end of the balance)
> > > and put the data in a table.
> > >
> > > Is that possible? If yes, what R-packages
> > > should I use?
> > >
> > > Ciao
> > > Vittorio
> > >
> > > ______________________________________________
> > > R-help at stat.math.ethz.ch mailing list
> > > https://stat.ethz.ch/mailman/listinfo/r-help
> > > PLEASE do read the posting guide
> > > http://www.R-project.org/posting-guide.html and provide commented,
> > > minimal, self-contained, reproducible code.



More information about the R-help mailing list