[R] help with web scraping

Spencer Graves @pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Fri Jul 24 15:06:44 CEST 2020


Hi Bill et al.:


       That broke the dam:  It gave me a character vector of length 1 
consisting of 218 KB.  I fed that to XML::readHTMLTable and 
purrr::map_chr, both of which returned lists of 337 data.frames. The 
former retained names for all the tables, absent from the latter.  The 
columns of the former are all character;  that's not true for the latter.


       Sadly, it's not quite what I want:  It's one table for each 
office-party combination, but it's lost the office designation. However, 
I'm confident I can figure out how to hack that.


       Thanks,
       Spencer Graves


On 2020-07-23 17:46, William Michels wrote:
> Hi Spencer,
>
> I tried the code below on an older R-installation, and it works fine.
> Not a full solution, but it's a start:
>
>> library(RCurl)
> Loading required package: bitops
>> url <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>> M_sos <- getURL(url)
>> print(M_sos)
> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
> content=\"width=device-width, initial-scale=1.0\" [...remainder
> truncated].
>
> HTH, Bill.
>
> W. Michels, Ph.D.
>
>
>
> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
> <spencer.graves using effectivedefense.org> wrote:
>> Hello, All:
>>
>>
>>         I've failed with multiple attempts to scrape the table of
>> candidates from the website of the Missouri Secretary of State:
>>
>>
>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975
>>
>>
>>         I've tried base::url, base::readLines, xml2::read_html, and
>> XML::readHTMLTable; see summary below.
>>
>>
>>         Suggestions?
>>         Thanks,
>>         Spencer Graves
>>
>>
>> sosURL <-
>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>
>> str(baseURL <- base::url(sosURL))
>> # this might give me something, but I don't know what
>>
>> sosRead <- base::readLines(sosURL) # 404 Not Found
>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>
>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>
>> sosXML <- XML::readHTMLTable(sosURL)
>> # List of 0;  does not seem to be XML
>>
>> sessionInfo()
>>
>> R version 4.0.2 (2020-06-22)
>> Platform: x86_64-apple-darwin17.0 (64-bit)
>> Running under: macOS Catalina 10.15.5
>>
>> Matrix products: default
>> BLAS:
>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib
>> LAPACK:
>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats     graphics  grDevices utils     datasets
>> [6] methods   base
>>
>> loaded via a namespace (and not attached):
>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>> [4] xml2_1.3.2     XML_3.99-0.3
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list