[R] [External] Re: help with web scraping

iuke-tier@ey m@iii@g oii uiow@@edu iuke-tier@ey m@iii@g oii uiow@@edu
Fri Jul 24 15:20:09 CEST 2020


Maybe try something like this:

url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
h <- xml2::read_html(url)
tbl <- rvest::html_table(h)

Best,

luke

On Fri, 24 Jul 2020, Spencer Graves wrote:

> Hi Bill et al.:
>
>
>       That broke the dam:  It gave me a character vector of length 1 
> consisting of 218 KB.  I fed that to XML::readHTMLTable and purrr::map_chr, 
> both of which returned lists of 337 data.frames. The former retained names 
> for all the tables, absent from the latter.  The columns of the former are 
> all character;  that's not true for the latter.
>
>
>       Sadly, it's not quite what I want:  It's one table for each 
> office-party combination, but it's lost the office designation. However, I'm 
> confident I can figure out how to hack that.
>
>
>       Thanks,
>       Spencer Graves
>
>
> On 2020-07-23 17:46, William Michels wrote:
>> Hi Spencer,
>> 
>> I tried the code below on an older R-installation, and it works fine.
>> Not a full solution, but it's a start:
>> 
>>> library(RCurl)
>> Loading required package: bitops
>>> url <- 
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> M_sos <- getURL(url)
>>> print(M_sos)
>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
>> content=\"width=device-width, initial-scale=1.0\" [...remainder
>> truncated].
>> 
>> HTH, Bill.
>> 
>> W. Michels, Ph.D.
>> 
>> 
>> 
>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>> <spencer.graves using effectivedefense.org> wrote:
>>> Hello, All:
>>> 
>>>
>>>         I've failed with multiple attempts to scrape the table of
>>> candidates from the website of the Missouri Secretary of State:
>>> 
>>> 
>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>> 
>>>
>>>         I've tried base::url, base::readLines, xml2::read_html, and
>>> XML::readHTMLTable; see summary below.
>>> 
>>>
>>>         Suggestions?
>>>         Thanks,
>>>         Spencer Graves
>>> 
>>> 
>>> sosURL <-
>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>> 
>>> str(baseURL <- base::url(sosURL))
>>> # this might give me something, but I don't know what
>>> 
>>> sosRead <- base::readLines(sosURL) # 404 Not Found
>>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>> 
>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>> 
>>> sosXML <- XML::readHTMLTable(sosURL)
>>> # List of 0;  does not seem to be XML
>>> 
>>> sessionInfo()
>>> 
>>> R version 4.0.2 (2020-06-22)
>>> Platform: x86_64-apple-darwin17.0 (64-bit)
>>> Running under: macOS Catalina 10.15.5
>>> 
>>> Matrix products: default
>>> BLAS:
>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>>> LAPACK:
>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>> 
>>> locale:
>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>> 
>>> attached base packages:
>>> [1] stats     graphics  grDevices utils     datasets
>>> [6] methods   base
>>> 
>>> loaded via a namespace (and not attached):
>>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>>> [4] xml2_1.3.2     XML_3.99-0.3
>>> 
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide 
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Luke Tierney
Ralph E. Wareham Professor of Mathematical Sciences
University of Iowa                  Phone:             319-335-3386
Department of Statistics and        Fax:               319-335-3017
    Actuarial Science
241 Schaeffer Hall                  email:   luke-tierney using uiowa.edu
Iowa City, IA 52242                 WWW:  http://www.stat.uiowa.edu


More information about the R-help mailing list