[R] prevent XML::readHTMLTable from suppressing <br/>

Rasmus Liland jr@| @end|ng |rom po@teo@no
Sat Jul 25 12:20:43 CEST 2020


On 2020-07-24 22:59 -0500, Spencer Graves wrote:
> Hello, All:
> 
> Thanks to Rasmus Liland, William 
> Michels, and Luke Tierney with my 
> earlier web scraping question.  With 
> their help, I've made progress.  
> Sadly, I still have a problem:  One 
> field has "<br/>", which gets 
> suppressed by XML::readHTMLTable:
> 
> sosURL <- "https://s1.sos.mo.gov/CandidatesOnWeb/DisplayCandidatesPlacement.aspx?ElectionCode=750004975"
> sosChars <- RCurl::getURL(sosURL)
> MOcan <- XML::readHTMLTable(sosChars)
> MOcan[[2]][1, 2]
> [1] "4476 FIVE MILE RDSENECA MO 64865"
> 
> (Seneca <- regexpr('SENECA', sosChars))
> substring(sosChars, Seneca-22, Seneca+14)
> 
> [1] "4476 FIVE MILE RD<br/>SENECA MO 64865"
> 
> How can I get essentially the same 
> result but without having > 
> XML::readHTMLTable suppress "<br/>"?
> 
> NOTE:  I get something very similar with xml2::read_html and
> rvest::html_table:
> 
> sosPointers <- xml2::read_html(sosChars)
> MOcan2 <- rvest::html_table(sosPointers)
> MOcan2[[2]][1, 2]
> [1] "4476 FIVE MILE RDSENECA MO 64865"
> 
> MOcan2 does not have names, and some 
> of the fields are automatically 
> converted to integers, which I think 
> is not smart in this application.

Yes, I observed this also, if you see my 
challenging quest to you in the old 
thread.

You could just edit it yourself by 
finding all the string separators:

	cities <-
	  c("KANSAS CITY",
	    "SENECA MO")
	for (city in cities) {
	  idx <- grepl(city, tab[,"Mailing Address"])
	  tab[idx,"Mailing Address"] <-
	    sapply(strsplit(tab[idx,"Mailing Address"], city), paste,
	      collapse=paste0("\n", city))
	}
	cat(sum(!grepl("\n", tab[,"Mailing Address"])),
	    "addresses left to hard-code a newline char into!", "\n")

... I'm sure the post office can mail 
out your snail mail letters correctly if 
you put the addresses in without the 
newline char, after all the area code is 
correct ... 

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200725/8ac9e8ee/attachment.sig>


More information about the R-help mailing list