[R] [External] Re: help with web scraping

Spencer Graves @pencer@gr@ve@ @end|ng |rom e||ect|vede|en@e@org
Fri Jul 24 15:58:33 CEST 2020



On 2020-07-24 08:20, luke-tierney using uiowa.edu wrote:
> Maybe try something like this:
>
> url <- 
> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> h <- xml2::read_html(url)


Error in open.connection(x, "rb") : HTTP error 404.


       Thanks for the suggestion, but this failed for me on the platform 
described in "sessionInfo" below.


> tbl <- rvest::html_table(h)


       As I previously noted, RCurl::getURL returned a single character 
string of roughly 218 KB, from which I've so far gotten most but not all 
of what I want.  Unfortunately, when I fed that character vector to 
rvest::html_table, I got:


Error in UseMethod("html_table") :
   no applicable method for 'html_table' applied to an object of class 
"character"


       I don't know for sure yet, but I believe I'll be able to get what 
I want from the single character string using, e.g., gregexpr and other 
functions.


       Thanks again,
       Spencer Graves

>
> Best,
>
> luke
>
> On Fri, 24 Jul 2020, Spencer Graves wrote:
>
>> Hi Bill et al.:
>>
>>
>>       That broke the dam:  It gave me a character vector of length 1 
>> consisting of 218 KB.  I fed that to XML::readHTMLTable and 
>> purrr::map_chr, both of which returned lists of 337 data.frames. The 
>> former retained names for all the tables, absent from the latter.  
>> The columns of the former are all character;  that's not true for the 
>> latter.
>>
>>
>>       Sadly, it's not quite what I want:  It's one table for each 
>> office-party combination, but it's lost the office designation. 
>> However, I'm confident I can figure out how to hack that.
>>
>>
>>       Thanks,
>>       Spencer Graves
>>
>>
>> On 2020-07-23 17:46, William Michels wrote:
>>> Hi Spencer,
>>>
>>> I tried the code below on an older R-installation, and it works fine.
>>> Not a full solution, but it's a start:
>>>
>>>> library(RCurl)
>>> Loading required package: bitops
>>>> url <- 
>>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
>>>> M_sos <- getURL(url)
>>>> print(M_sos)
>>> [1] "\r\n<!DOCTYPE html>\r\n\r\n<html
>>> lang=\"en-us\">\r\n<head><title>\r\n\tSOS, Missouri - Elections:
>>> Offices Filed in Candidate Filing\r\n</title><meta name=\"viewport\"
>>> content=\"width=device-width, initial-scale=1.0\" [...remainder
>>> truncated].
>>>
>>> HTH, Bill.
>>>
>>> W. Michels, Ph.D.
>>>
>>>
>>>
>>> On Thu, Jul 23, 2020 at 2:55 PM Spencer Graves
>>> <spencer.graves using effectivedefense.org> wrote:
>>>> Hello, All:
>>>>
>>>>
>>>>         I've failed with multiple attempts to scrape the table of
>>>> candidates from the website of the Missouri Secretary of State:
>>>>
>>>>
>>>> https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975 
>>>>
>>>>
>>>>
>>>>         I've tried base::url, base::readLines, xml2::read_html, and
>>>> XML::readHTMLTable; see summary below.
>>>>
>>>>
>>>>         Suggestions?
>>>>         Thanks,
>>>>         Spencer Graves
>>>>
>>>>
>>>> sosURL <-
>>>> "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975" 
>>>>
>>>>
>>>> str(baseURL <- base::url(sosURL))
>>>> # this might give me something, but I don't know what
>>>>
>>>> sosRead <- base::readLines(sosURL) # 404 Not Found
>>>> sosRb <- base::readLines(baseURL) # 404 Not Found
>>>>
>>>> sosXml2 <- xml2::read_html(sosURL) # HTTP error 404.
>>>>
>>>> sosXML <- XML::readHTMLTable(sosURL)
>>>> # List of 0;  does not seem to be XML
>>>>
>>>> sessionInfo()
>>>>
>>>> R version 4.0.2 (2020-06-22)
>>>> Platform: x86_64-apple-darwin17.0 (64-bit)
>>>> Running under: macOS Catalina 10.15.5
>>>>
>>>> Matrix products: default
>>>> BLAS:
>>>> /System/Library/Frameworks/Accelerate.framework/Versions/A/Frameworks/vecLib.framework/Versions/A/libBLAS.dylib 
>>>>
>>>> LAPACK:
>>>> /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib 
>>>>
>>>>
>>>> locale:
>>>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>>>
>>>> attached base packages:
>>>> [1] stats     graphics  grDevices utils     datasets
>>>> [6] methods   base
>>>>
>>>> loaded via a namespace (and not attached):
>>>> [1] compiler_4.0.2 tools_4.0.2    curl_4.3
>>>> [4] xml2_1.3.2     XML_3.99-0.3
>>>>
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide 
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>



More information about the R-help mailing list