[Rd] URL checks

Fri Jan 8 14:50:14 CET 2021

Instead of a separate file to store such a list, would it be an idea to add versions of the \href{}{} and \url{} markup commands that are skipped by the URL checks?

Best,
Wolfgang

>-----Original Message-----
>From: R-devel [mailto:r-devel-bounces using r-project.org] On Behalf Of Spencer
>Graves
>Sent: Friday, 08 January, 2021 13:04
>To: r-devel using r-project.org
>Subject: Re: [Rd] URL checks
>
>	  I also would be pleased to be allowed to provide "a list of known
>false-positive/exceptions" to the URL tests.  I've been challenged
>multiple times regarding URLs that worked fine when I checked them.  We
>should not be required to do a partial lobotomy to pass R CMD check ;-)
>
>	  Spencer Graves
>
>On 2021-01-07 09:53, Hugo Gruson wrote:
>>
>> I encountered the same issue today with https://astrostatistics.psu.edu/.
>>
>> This is a trust chain issue, as explained here:
>> https://whatsmychaincert.com/?astrostatistics.psu.edu.
>>
>> I've worked for a couple of years on a project to increase HTTPS
>> adoption on the web and we noticed that this type of error is very
>> common, and that website maintainers are often unresponsive to requests
>> to fix this issue.
>>
>> Therefore, I totally agree with Kirill that a list of known
>> false-positive/exceptions would be a great addition to save time to both
>> the CRAN team and package developers.
>>
>> Hugo
>>
>> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>>> One other failure mode: SSL certificates trusted by browsers that are
>>> not installed on the check machine, e.g. the "GEANT Vereniging"
>>> certificate from https://relational.fit.cvut.cz/ .
>>>
>>> K
>>>
>>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>>>> Hi
>>>>
>>>> The URL checks in R CMD check test all links in the README and
>>>> vignettes for broken or redirected links. In many cases this improves
>>>> documentation, I see problems with this approach which I have
>>>> detailed below.
>>>>
>>>> I'm writing to this mailing list because I think the change needs to
>>>> happen in R's check routines. I propose to introduce an "allow-list"
>>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>>>
>>>> Comments are greatly appreciated.
>>>>
>>>> Best regards
>>>>
>>>> Kirill
>>>>
>>>> # Problems with the detection of broken/redirected URLs
>>>>
>>>> ## 301 should often be 307, how to change?
>>>>
>>>> Many web sites use a 301 redirection code that probably should be a
>>>> 307. For example, https://www.oracle.com and https://www.oracle.com/
>>>> both redirect to https://www.oracle.com/index.html with a 301. I
>>>> suspect the company still wants oracle.com to be recognized as the
>>>> primary entry point of their web presence (to reserve the right to
>>>> move the redirection to a different location later), I haven't
>>>> checked with their PR department though. If that's true, the redirect
>>>> probably should be a 307, which should be fixed by their IT
>>>> department which I haven't contacted yet either.
>>>>
>>>> $ curl -i https://www.oracle.com
>>>> HTTP/2 301
>>>> server: AkamaiGHost
>>>> content-length: 0
>>>> location: https://www.oracle.com/index.html
>>>> ...
>>>>
>>>> ## User agent detection
>>>>
>>>> twitter.com responds with a 400 error for requests without a user
>>>> agent string hinting at an accepted browser.
>>>>
>>>> $ curl -i https://twitter.com/
>>>> HTTP/2 400
>>>> ...
>>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>>>
>>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux
>>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>>> HTTP/2 200
>>>>
>>>> # Impact
>>>>
>>>> While the latter problem *could* be fixed by supplying a browser-like
>>>> user agent string, the former problem is virtually unfixable -- so
>>>> many web sites should use 307 instead of 301 but don't. The above
>>>> list is also incomplete -- think of unreliable links, HTTP links,
>>>> other failure modes...
>>>>
>>>> This affects me as a package maintainer, I have the choice to either
>>>> change the links to incorrect versions, or remove them altogether.
>>>>
>>>> I can also choose to explain each broken link to CRAN, this subjects
>>>> the team to undue burden I think. Submitting a package with NOTEs
>>>> delays the release for a package which I must release very soon to
>>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I
>>>> need to remove the link and put it back later.
>>>>
>>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the
>>>> problem but ultimately doesn't solve it.
>>>>
>>>> # Proposed solution
>>>>
>>>> ## Allow-list
>>>>
>>>> A file inst/URL that lists all URLs where failures are allowed --
>>>> possibly with a list of the HTTP codes accepted for that link.
>>>>
>>>> Example:
>>>>
>>>> https://oracle.com/ 301
>>>> https://twitter.com/drob/status/1224851726068527106 400