[R] [External] Re: help with web scraping

Rasmus Liland jr@| @end|ng |rom po@teo@no
Fri Jul 24 16:16:18 CEST 2020


On 2020-07-24 08:20 -0500, luke-tierney using uiowa.edu wrote:
> On Fri, 24 Jul 2020, Spencer Graves wrote:
> > On 2020-07-23 17:46, William Michels wrote:
> > > On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> > > <spencer.graves using effectivedefense.org> wrote:
> > > > Hello, All:
> > > > 
> > > > I've failed with multiple 
> > > > attempts to scrape the table of 
> > > > candidates from the website of 
> > > > the Missouri Secretary of 
> > > > State:
> > > > 
> > > > https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
> > > 
> > > Hi Spencer,
> > > 
> > > I tried the code below on an older 
> > > R-installation, and it works fine.  
> > > Not a full solution, but it's a 
> > > start:
> > > 
> > > > library(RCurl)
> > > Loading required package: bitops
> > > > url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > > > M_sos <- getURL(url)
> > 
> > Hi Bill et al.:
> > 
> > That broke the dam:  It gave me a 
> > character vector of length 1 
> > consisting of 218 KB.  I fed that to 
> > XML::readHTMLTable and 
> > purrr::map_chr, both of which 
> > returned lists of 337 data.frames. 
> > The former retained names for all 
> > the tables, absent from the latter.  
> > The columns of the former are all 
> > character;  that's not true for the 
> > latter.
> > 
> > Sadly, it's not quite what I want:  
> > It's one table for each office-party 
> > combination, but it's lost the 
> > office designation. However, I'm 
> > confident I can figure out how to 
> > hack that.
> 
> Maybe try something like this:
> 
> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> h <- xml2::read_html(url)
> tbl <- rvest::html_table(h)

Dear Spencer,

I unified the party tables after the 
first summary table like this:

	url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
	M_sos <- RCurl::getURL(url)
	saveRDS(object=M_sos, file="dcp.rds")
	dat <- XML::readHTMLTable(M_sos)
	idx <- 2:length(dat)
	cn <- unique(unlist(lapply(dat[idx], colnames)))
	dat <- do.call(rbind,
	  sapply(idx, function(i, dat, cn) {
	    x <- dat[[i]]
	    x[,cn[!(cn %in% colnames(x))]] <- NA
	    x <- x[,cn]
	    x$Party <- names(dat)[i]
	    return(list(x))
	  }, dat=dat, cn=cn))
	dat[,"Date Filed"] <-
	  as.Date(x=dat[,"Date Filed"],
	          format="%m/%d/%Y")
	write.table(dat, file="dcp.tsv", sep="\t",
	            row.names=FALSE,
	            quote=TRUE, na="N/A") 

Best,
Rasmus

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200724/1d52dffb/attachment.sig>


More information about the R-help mailing list