[R] Reading a web page in pdf format

Gabor Grothendieck ggrothendieck at gmail.com
Wed May 9 20:57:39 CEST 2007


Here is one additional solution.  This one produces a data frame.  The
regular expression removes:

- everything from beginning to first (
- everything from last ( to end
- everything between ) and ( in the middle

The | characters separate the three parts.  Then read.table reads it in.


URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
Lines.raw <- readLines(URL)
Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)

rx <- "^[^(]*[(]|[)][^(]*$|[)][^(]*[(]"
read.table(textConnection(gsub(rx, "", Lines)), dec = ",")


On 5/9/07, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Modify this to suit.  After grepping out the correct lines we use strapply
> to find and emit character sequences that come after a "(" but do not contain
> a ")" .  back = -1 says to only emit the backreferences and not the entire
> matched expression (which would have included the leading "(" ):
>
> URL <- "http://www.snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf"
> Lines.raw <- readLines(URL)
> Lines <- grep("Industriale|Termoelettrico", Lines.raw, value = TRUE)
> library(gsubfn)
> strapply(Lines, "[(]([^)]*)", back = -1, simplify = rbind)
>
> which gives a character matrix whose first column is the label
> and second column is the number in character form.  You can
> then manipulate it as desired.
>
> On 5/9/07, Vittorio <vdemart1 at tin.it> wrote:
> > Each day the daily balance in the following link
> >
> > http://www.
> > snamretegas.it/italiano/business/gas/bilancio/pdf/bilancio.pdf
> >
> > is
> > updated.
> >
> > I would like to set up an R procedure to be run daily in a
> > server able to read the figures in a couple of lines only
> > ("Industriale" and "Termoelettrico", towards the end of the balance)
> > and put the data in a table.
> >
> > Is that possible? If yes, what R-packages
> > should I use?
> >
> > Ciao
> > Vittorio
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> >
>



More information about the R-help mailing list