[Rd] as.Date (and strptime?) does not recognize "  " as a blank

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Thu Jul 7 19:42:34 CEST 2022


Depends  a bit on what you mean by "automatically". This seems to work for
me (note this has NOT been extensively tested on different OSes or even in
different locales/encodings):

library(XML)
myhtml <- "<html><body><table
id='hiya'><tr><th>colname</th></tr><tr><td> </td></tr><tr><td>
</td></tr></table></body></html>"
doc <- htmlParse(myhtml, asText = TRUE)
oldway <- readHTMLTable(doc, trim = FALSE)

identical(oldway$hiya$colname[1], oldway$hiya$colname[2]) # FALSE :(

decode_nbsp <- function(x) gsub(rawToChar(as.raw(c(0xc2, 0xa0))), " ", x,
fixed = TRUE, useBytes = TRUE)
fancypants <- function(node) decode_nbsp(xmlValue(node))
newandfancy <- readHTMLTable(doc, trim = FALSE, elFun = fancypants)

identical(newandfancy$hiya$colname[1], newandfancy$hiya$colname[2]) # TRUE
:D

Best,
~G

On Fri, Jun 24, 2022 at 11:48 PM Spencer Graves <spencer.graves using prodsyse.com>
wrote:

> p.s.  Is there a way to get XML::readHTMLTable to automatically convert
> " " to a normal blank space?
>
>
> On 6/25/22 1:37 AM, Spencer Graves wrote:
> > Hello, All:
> >
> >
> >        When is a space not a space?
> >
> >
> >        Consider the following:
> >
> >
> >  > (pblmDate <- textutils::HTMLdecode(" 2 Mar 2018"))
> > [1] " 2 Mar 2018"
> >  > as.Date(pblmDate, format='%e %b %Y')
> > [1] NA
> >  > as.Date(' 2 Mar 2018', format='%e %b %Y')
> > [1] "2018-03-02"
> >
> >
> >        Is this a feature or a bug?
> >
> >
> >        I can work around it, now that I know what it is, but it took me
> > a few hours to diagnose.
> >
> >
> >        Thanks,
> >        Spencer Graves
> >
> >
> > p.s.  I got this from scraping a website with code that had worked for
> > me roughly 20 months ago.  I suspect that in the interim, someone
> > probably replaced ' 2 Mar 2018' with " 2 Mar 2018".
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list