[R] [External] Re: help with web scraping

Rasmus Liland jr@| @end|ng |rom po@teo@no
Sun Jul 26 17:43:59 CEST 2020


Dear William Michels,

On 2020-07-25 10:58 -0700, William Michels wrote:
> 
> Dear Spencer Graves (and Rasmus Liland),
> 
> I've had some luck just using gsub() 
> to alter the offending "</br>" 
> characters, appending a "___" tag at 
> each instance of "<br>" (first I 
> checked the text to make sure it 
> didn't contain any pre-existing 
> instances of "___"). See the output 
> snippet below:
> 
> > library(RCurl)
> > library(XML)
> > sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> > sosChars <- getURL(sosURL)
> > sosChars2 <- gsub("<br/>", "<br/>___", sosChars)
> > MOcan <- readHTMLTable(sosChars2)
> > MOcan[[2]]
>                   Name
> 1       Raleigh Ritter
> 2          Mike Parson
> 3 James W. (Jim) Neely
> 4     Saundra McDowell
>                            Mailing Address
> 1      4476 FIVE MILE RD___SENECA MO 64865
> 2         1458 E 464 RD___BOLIVAR MO 65613
> 3            PO BOX 343___CAMERON MO 64429
> 4 3854 SOUTH AVENUE___SPRINGFIELD MO 65807
>   Random Number Date Filed
> 1           185  2/25/2020
> 2           348  2/25/2020
> 3           477  2/25/2020
> 4                3/31/2020
> >
> 
> It's true, there's one a 'section' of 
> MOcan output that contains odd-looking 
> characters (see the "Total" line of 
> MOcan[[1]]). But my guess is you'll be 
> deleting this 'line' anyway--and 
> recalulating totals in R.

Perhaps it's the this table you mean?  

	                Offices Republican
	1              Governor          4
	2   Lieutenant Governor          4
	3    Secretary of State          1
	4       State Treasurer          1
	5      Attorney General          1
	6   U.S. Representative         24
	7         State Senator         28
	8  State Representative        187
	9         Circuit Judge         18
	10                Total 268\r\n___
	   Democratic Libertarian    Green
	1           5           1        1
	2           2           1        1
	3           1           1        1
	4           1           1        1
	5           2           1        0
	6          16           9        0
	7          22           2        1
	8         137           6        2
	9           1           0        0
	10 187\r\n___   22\r\n___ 7\r\n___
	   Constitution      Total
	1             0         11
	2             0          8
	3             1          5
	4             0          4
	5             0          4
	6             0         49
	7             0         53
	8             1        333
	9             0         19
	10     2\r\n___ 486\r\n___

Yes, somehow the Windows[1] character 
"0xD" gets converted to "\r\n" after 
your gsub, "<br/>" is still ignored.  

There is not a "0xD" inside the 
td.AddressCol cells in the tables we are 
interested in.

> Now that you have a comprehensive list 
> object, you should be able to pull out 
> districts/races of interest. You might 
> want to take a look at the "rlist" 
> package, to see if it can make your 
> work a little easier:
> 
> https://CRAN.R-project.org/package=rlist
> https://renkun-ken.github.io/rlist-tutorial/index.html

Thank you, this package seems useful.  

Please can you provide a hint (maybe) as 
to which of the many functions you were 
thinking of?  E.g. instead of using for 
over the index of the list of headers 
and tables, if typeof list or character, 
and updating variables to write in the 
political position to each table. 

V

r

[1] https://stackoverflow.com/questions/5843495/what-does-m-character-mean-in-vim

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200726/023e8723/attachment.sig>


More information about the R-help mailing list