[Rd] URL checks

Fri Jan 8 13:04:13 CET 2021

	  I also would be pleased to be allowed to provide "a list of known 
false-positive/exceptions" to the URL tests.  I've been challenged 
multiple times regarding URLs that worked fine when I checked them.  We 
should not be required to do a partial lobotomy to pass R CMD check ;-)


	  Spencer Graves


On 2021-01-07 09:53, Hugo Gruson wrote:
> 
> I encountered the same issue today with https://astrostatistics.psu.edu/.
> 
> This is a trust chain issue, as explained here: 
> https://whatsmychaincert.com/?astrostatistics.psu.edu.
> 
> I've worked for a couple of years on a project to increase HTTPS 
> adoption on the web and we noticed that this type of error is very 
> common, and that website maintainers are often unresponsive to requests 
> to fix this issue.
> 
> Therefore, I totally agree with Kirill that a list of known 
> false-positive/exceptions would be a great addition to save time to both 
> the CRAN team and package developers.
> 
> Hugo
> 
> On 07/01/2021 15:45, Kirill Müller via R-devel wrote:
>> One other failure mode: SSL certificates trusted by browsers that are 
>> not installed on the check machine, e.g. the "GEANT Vereniging" 
>> certificate from https://relational.fit.cvut.cz/ .
>>
>>
>> K
>>
>>
>> On 07.01.21 12:14, Kirill Müller via R-devel wrote:
>>> Hi
>>>
>>>
>>> The URL checks in R CMD check test all links in the README and 
>>> vignettes for broken or redirected links. In many cases this improves 
>>> documentation, I see problems with this approach which I have 
>>> detailed below.
>>>
>>> I'm writing to this mailing list because I think the change needs to 
>>> happen in R's check routines. I propose to introduce an "allow-list" 
>>> for URLs, to reduce the burden on both CRAN and package maintainers.
>>>
>>> Comments are greatly appreciated.
>>>
>>>
>>> Best regards
>>>
>>> Kirill
>>>
>>>
>>> # Problems with the detection of broken/redirected URLs
>>>
>>> ## 301 should often be 307, how to change?
>>>
>>> Many web sites use a 301 redirection code that probably should be a 
>>> 307. For example, https://www.oracle.com and https://www.oracle.com/ 
>>> both redirect to https://www.oracle.com/index.html with a 301. I 
>>> suspect the company still wants oracle.com to be recognized as the 
>>> primary entry point of their web presence (to reserve the right to 
>>> move the redirection to a different location later), I haven't 
>>> checked with their PR department though. If that's true, the redirect 
>>> probably should be a 307, which should be fixed by their IT 
>>> department which I haven't contacted yet either.
>>>
>>> $ curl -i https://www.oracle.com
>>> HTTP/2 301
>>> server: AkamaiGHost
>>> content-length: 0
>>> location: https://www.oracle.com/index.html
>>> ...
>>>
>>> ## User agent detection
>>>
>>> twitter.com responds with a 400 error for requests without a user 
>>> agent string hinting at an accepted browser.
>>>
>>> $ curl -i https://twitter.com/
>>> HTTP/2 400
>>> ...
>>> <body>...<p>Please switch to a supported browser...</p>...</body>
>>>
>>> $ curl -s -i https://twitter.com/ -A "Mozilla/5.0 (X11; Ubuntu; Linux 
>>> x86_64; rv:84.0) Gecko/20100101 Firefox/84.0" | head -n 1
>>> HTTP/2 200
>>>
>>> # Impact
>>>
>>> While the latter problem *could* be fixed by supplying a browser-like 
>>> user agent string, the former problem is virtually unfixable -- so 
>>> many web sites should use 307 instead of 301 but don't. The above 
>>> list is also incomplete -- think of unreliable links, HTTP links, 
>>> other failure modes...
>>>
>>> This affects me as a package maintainer, I have the choice to either 
>>> change the links to incorrect versions, or remove them altogether.
>>>
>>> I can also choose to explain each broken link to CRAN, this subjects 
>>> the team to undue burden I think. Submitting a package with NOTEs 
>>> delays the release for a package which I must release very soon to 
>>> avoid having it pulled from CRAN, I'd rather not risk that -- hence I 
>>> need to remove the link and put it back later.
>>>
>>> I'm aware of https://github.com/r-lib/urlchecker, this alleviates the 
>>> problem but ultimately doesn't solve it.
>>>
>>> # Proposed solution
>>>
>>> ## Allow-list
>>>
>>> A file inst/URL that lists all URLs where failures are allowed -- 
>>> possibly with a list of the HTTP codes accepted for that link.
>>>
>>> Example:
>>>
>>> https://oracle.com/ 301
>>> https://twitter.com/drob/status/1224851726068527106 400
>>>
>>> ______________________________________________
>>> R-devel using r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>> ______________________________________________
>> R-devel using r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel