[R] [External] Re: help with web scraping

William Michels wjm1 @end|ng |rom c@@@co|umb|@@edu
Sat Jul 25 19:58:12 CEST 2020


Dear Spencer Graves (and Rasmus Liland),

I've had some luck just using gsub() to alter the offending "</br>"
characters, appending a "___" tag at each instance of "<br>" (first I
checked the text to make sure it didn't contain any pre-existing
instances of "___"). See the output snippet below:

> library(RCurl)
> library(XML)
> sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> sosChars <- getURL(sosURL)
> sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> MOcan <- readHTMLTable(sosChars2)
> MOcan[[2]]
                  Name                          Mailing Address Random
Number Date Filed
1       Raleigh Ritter      4476 FIVE MILE RD___SENECA MO 64865
   185  2/25/2020
2          Mike Parson         1458 E 464 RD___BOLIVAR MO 65613
   348  2/25/2020
3 James W. (Jim) Neely            PO BOX 343___CAMERON MO 64429
   477  2/25/2020
4     Saundra McDowell 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
        3/31/2020
>

It's true, there's one a 'section' of MOcan output that contains
odd-looking characters (see the "Total" line of MOcan[[1]]). But my
guess is you'll be deleting this 'line' anyway--and recalulating totals in R.

Now that you have a comprehensive list object, you should be able to
pull out districts/races of interest. You might want to take a look at
the "rlist" package, to see if it can make your work a little easier:

https://CRAN.R-project.org/package=rlist
https://renkun-ken.github.io/rlist-tutorial/index.html

HTH, Bill.

W. Michels, Ph.D.









On Sat, Jul 25, 2020 at 7:56 AM Spencer Graves
<spencer.graves using effectivedefense.org> wrote:
>
> Dear Rasmus et al.:
>
>
> On 2020-07-25 04:10, Rasmus Liland wrote:
> > On 2020-07-24 10:28 -0500, Spencer Graves wrote:
> >> Dear Rasmus:
> >>
> >>> Dear Spencer,
> >>>
> >>> I unified the party tables after the
> >>> first summary table like this:
> >>>
> >>>     url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> >>>     M_sos <- RCurl::getURL(url)
> >>>     saveRDS(object=M_sos, file="dcp.rds")
> >>>     dat <- XML::readHTMLTable(M_sos)
> >>>     idx <- 2:length(dat)
> >>>     cn <- unique(unlist(lapply(dat[idx], colnames)))
> >> This is useful for this application.
> >>
> >>>     dat <- do.call(rbind,
> >>>       sapply(idx, function(i, dat, cn) {
> >>>         x <- dat[[i]]
> >>>         x[,cn[!(cn %in% colnames(x))]] <- NA
> >>>         x <- x[,cn]
> >>>         x$Party <- names(dat)[i]
> >>>         return(list(x))
> >>>       }, dat=dat, cn=cn))
> >>>     dat[,"Date Filed"] <-
> >>>       as.Date(x=dat[,"Date Filed"],
> >>>               format="%m/%d/%Y")
> >> This misses something extremely
> >> important for this application:?  The
> >> political office.? That's buried in
> >> the HTML or whatever it is.? I'm using
> >> something like the following to find
> >> that:
> >>
> >> str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])
> > Dear Spencer,
> >
> > I came up with a solution, but it is not
> > very elegant.  Instead of showing you
> > the solution, hoping you understand
> > everything in it, I istead want to give
> > you some emphatic hints to see if you
> > can come up with a solution on you own.
> >
> > - XML::htmlTreeParse(M_sos)
> >    - *Gandalf voice*: climb the tree
> >      until you find the content you are
> >      looking for flat out at the level of
> >      «The Children of the Div», *uuuUUU*
> >    - you only want to keep the table and
> >      header tags at this level
> > - Use XML::xmlValue to extract the
> >    values of all the headers (the
> >    political positions)
> > - Observe that all the tables on the
> >    page you were able to extract
> >    previously using XML::readHTMLTable,
> >    are at this level, shuffled between
> >    the political position header tags,
> >    this means you extract the political
> >    position and party affiliation by
> >    using a for loop, if statements,
> >    typeof, names, and [] and [[]] to grab
> >    different things from the list
> >    (content or the bag itself).
> >    XML::readHTMLTable strips away the
> >    line break tags from the Mailing
> >    address, so if you find a better way
> >    of extracting the tables, tell me,
> >    e.g. you get
> >
> >       8805 HUNTER AVEKANSAS CITY MO 64138
> >
> >    and not
> >
> >       8805 HUNTER AVE<br/>KANSAS CITY MO 64138
> >
> > When you've completed this «programming
> > quest», you're back at the level of the
> > previous email, i.e.  you have have the
> > same tables, but with political position
> > and party affiliation added to them.
>
>
>        Please excuse:  Before my last post, I had written code to do all
> that.  In brief, the political offices are "h3" tags.  I used "strsplit"
> to split the string at "<h3>".  I then wrote a function to find "</h3>",
> extract the political office and pass the rest to "XML::readHTMLTable",
> adding columns for party and political office.
>
>
>        However, this suppressed "<br/>" everywhere.  I thought there
> should be an option with something like "XML::readHTMLTable" that would
> not delete "<br/>" everywhere, but I couldn't find it.  If you aren't
> aware of one, I can gsub("<br/>", "\n", ...) on the string for each
> political office before passing it to "XML::readHTMLTable".  I just
> tested this:  It works.
>
>
>        I have other web scraping problems in my work plan for the few
> days.  I will definitely try XML::htmlTreeParse, etc., as you suggest.
>
>
>        Thanks again.
>        Spencer Graves
> >
> > Best,
> > Rasmus
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>         [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list