[R] [External] Re: help with web scraping

Rasmus Liland jr@| @end|ng |rom po@teo@no
Sun Jul 26 17:43:49 CEST 2020


Dear GRAVES et al.,

On 2020-07-25 12:43 -0500, Spencer Graves wrote:
> Dear Rasmus Liland et al.:
> 
> On 2020-07-25 11:30, Rasmus Liland wrote:
> > On 2020-07-25 09:56 -0500, Spencer Graves wrote:
> > > Dear Rasmus et al.:
> > 
> > It is LILAND et al., is it not?  ... else it's customary to
> > put a comma in there, isn't it? ...
> 
> The APA Style recommends "Sharp et al., 2007":
> 
> https://blog.apastyle.org/apastyle/2011/11/the-proper-use-of-et-al-in-apa-style.html

If "Sharp et al., 2007" is an APA 
citation of this book[*], Sharp is John A 
Sharp's surname, Liland is my surname.  
Q.E.D.

I have not used APA before (as I am not 
a Psychiatrist), as the minimalism of 
IEEE[**] always seemed more desirable.  

> Regarding Confucius, I'm confused.

Nevermind, just fooling around, that's 
all.

> > On 2020-07-25 04:10, Rasmus Liland wrote:
> > > 
> > > However, this suppressed "<br/>"
> > > everywhere.?
> > 
> > Why is that, please explain.
> 
> I don't know why the Missouri 
> Secretary of State's web site includes 
> "<br/>" to signal a new line, but it 
> does.

Me neither!  On top of that, <br /> is 
actually[***] an XHTML tag, not an HTML 
tag.

> I also don't know why 
> XML::readHTMLTable suppressed "<br/>" 
> everywhere it occurred, but it did 
> that.

Yes, I know, I also observed this.  But 
now we swiftly solved this by gsubbig it 
with the newline char, "\n", which does 
not make sense for HTML parses anyway. 

> > > If you aren't aware of one, I can
> > > gsub("<br/>", "\n", ...) on the string
> > > for each political office before
> > > passing it to "XML::readHTMLTable".? I
> > > just tested this:? It works.
> > 
> > Such a great hack!  IMHO, this is much
> > more flexible than using
> > xml2::read_html, rvest::read_table,
> > dplyr::mutate like here[1]
> > 
> > [1] https://stackoverflow.com/questions/38707669/how-to-read-an-html-table-and-account-for-line-breaks-within-cells
> 
> And I added my solution to this 
> problem to this Stackoverflow thread.

I wish you many upvotes, alas the 
political competition is obiously not 
tough there, as the other guy just got 
one down vote.

[*] https://www.amazon.co.uk/Management-Student-Research-Project/dp/0566084902 
[**] https://pitt.libguides.com/citationhelp/ieee
[***] https://stackoverflow.com/questions/1946426/html-5-is-it-br-br-or-br

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 833 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20200726/138ec8c5/attachment.sig>


More information about the R-help mailing list