[R] [External] Re: help with web scraping

Fri Jul 24 17:28:12 CEST 2020

Dear Rasmus:

On 2020-07-24 09:16, Rasmus Liland wrote:
> On 2020-07-24 08:20 -0500, luke-tierney using uiowa.edu wrote:
>> On Fri, 24 Jul 2020, Spencer Graves wrote:
>>> On 2020-07-23 17:46, William Michels wrote:
>>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>>>> <spencer.graves using effectivedefense.org> wrote:
>>>>> Hello, All:
>>>>>
>>>>> I've failed with multiple
>>>>> attempts to scrape the table of
>>>>> candidates from the website of
>>>>> the Missouri Secretary of
>>>>> State:
>>>>>
>>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>>> Hi Spencer,
>>>>
>>>> I tried the code below on an older
>>>> R-installation, and it works fine.
>>>> Not a full solution, but it's a
>>>> start:
>>>>
>>>>> library(RCurl)
>>>> Loading required package: bitops
>>>>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>>> M_sos <- getURL(url)
>>> Hi Bill et al.:
>>>
>>> That broke the dam:� It gave me a
>>> character vector of length 1
>>> consisting of 218 KB.� I fed that to
>>> XML::readHTMLTable and
>>> purrr::map_chr, both of which
>>> returned lists of 337 data.frames.
>>> The former retained names for all
>>> the tables, absent from the latter.
>>> The columns of the former are all
>>> character;� that's not true for the
>>> latter.
>>>
>>> Sadly, it's not quite what I want:
>>> It's one table for each office-party
>>> combination, but it's lost the
>>> office designation. However, I'm
>>> confident I can figure out how to
>>> hack that.
>> Maybe try something like this:
>>
>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>> h <- xml2::read_html(url)
>> tbl <- rvest::html_table(h)
> Dear Spencer,
>
> I unified the party tables after the
> first summary table like this:
>
> 	url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> 	M_sos <- RCurl::getURL(url)
> 	saveRDS(object=M_sos, file="dcp.rds")
> 	dat <- XML::readHTMLTable(M_sos)
> 	idx <- 2:length(dat)
> 	cn <- unique(unlist(lapply(dat[idx], colnames)))

 ����� This is useful for this application.

> 	dat <- do.call(rbind,
> 	  sapply(idx, function(i, dat, cn) {
> 	    x <- dat[[i]]
> 	    x[,cn[!(cn %in% colnames(x))]] <- NA
> 	    x <- x[,cn]
> 	    x$Party <- names(dat)[i]
> 	    return(list(x))
> 	  }, dat=dat, cn=cn))
> 	dat[,"Date Filed"] <-
> 	  as.Date(x=dat[,"Date Filed"],
> 	          format="%m/%d/%Y")

 ����� This misses something extremely important for this application:� 
The political office.� That's buried in the HTML or whatever it is.� I'm 
using something like the following to find that:

str(LtGov <- gregexpr('Lieutenant Governor', M_sos)[[1]])

 ����� After I figure this out, I will use something like your code to 
combine it all into separate tables for each office, and then probably 
combine those into one table for the offices I'm interested in.� For my 
present purposes, I don't want all the offices in Missouri, only the 
executive positions and those representing parts of the Kansas City 
metro area in the Missouri legislature.

 ����� Thanks again,
 ����� Spencer Graves

> 	write.table(dat, file="dcp.tsv", sep="\t",
> 	            row.names=FALSE,
> 	            quote=TRUE, na="N/A")
>
> Best,
> Rasmus
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

	[[alternative HTML version deleted]]