[Rd] read.table() fails with https in R 3.6 but not in R 3.5

Tomas Kalibera tom@@@k@||ber@ @end|ng |rom gm@||@com
Mon May 13 12:42:43 CEST 2019


On 5/6/19 2:27 PM, Stephen Berman wrote:
> On Mon, 6 May 2019 11:12:25 +0200 Ralf Stubner <ralf.stubner using daqana.com> wrote:
>
>> On 04.05.19 19:04, Stephen Berman wrote:
>>> In versions of R prior to 3.6.0 the following invocation succeeds,
>>> returning the data frame shown:
>>>
>>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>>> header=TRUE)
>>>     Dekade   Anzahl
>>> 1    1900 11467254
>>> 2    1910 13023370
>>> 3    1920 13434601
>>> 4    1930 13296355
>>> 5    1940 12121250
>>> 6    1950 13191131
>>> 7    1960 10587420
>>> 8    1970 10944129
>>> 9    1980 11279439
>>> 10   1990 12052652
>>>
>>> But in version 3.6.0 it fails:
>>>
>>>> read.table("https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text",
>>>> header=TRUE)
>>> Error in file(file, "rt") :
>>>    cannot open the connection to
>>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text'
>>> In addition: Warning message:
>>> In file(file, "rt") :
>>>    cannot open URL
>>> 'https://www.dwds.de/r/stat?corpus=kern&cnt=tokens&date=decade&format=text':
>>> HTTP status was '403 Forbidden'
>> I can reproduce the behavior on Debian using the CRAN supplied package
>> for R 3.6.0. Trying to read the page with 'curl' produces also a 403
>> error plus some HTML text (in German) explaining that I am treated as a
>> 'robot' due to the supplied User-Agent (here: curl/7.52.1). One
>> suggested solution is to adjust that value which does solve the issue:
>>
>>   > options(HTTPUserAgent='mozilla')
> I confirm that works for me, too.  Thanks!  FWIW, the default value of
> HTTPUserAgent in R 3.6 here is "R (3.6.0 x86_64-suse-linux-gnu x86_64
> linux-gnu)", and using this (in R 3.6) fails as I reported, while the
> default value of HTTPUserAgent in R 3.5 here is "R (3.5.0
> x86_64-suse-linux-gnu x86_64 linux-gnu)" and using that (in R 3.5)
> succeeds.  However, setting HTTPUserAgent in R 3.5 to "libcurl/7.60.0"
> fails just as it does in 3.6.  It's not clear to me if this particular
> website is being too restrictive or if R 3.6 should deal with it, or at
> least mention the issue in NEWS or somewhere else.

This is because (from NEWS:)

The default ‘user agent’ has been changed when accessing http://
       and https:// sites using libcurl.  (A site was found which caused
       libcurl to infinite-loop with the previous default.)

This website is ok with the default R user agent specification (also for 
R 3.6 and R-devel), but it is not ok with "libcurl/...". Setting the 
user agent to anything starting with "R (" will not help in R 3.6, 
because it will get automatically changed to "libcurl/..." when libcurl 
is used (note using wget and curl on the command line fails on this 
website). I am afraid it has to be solved on the user side (e.g. as 
hinted in that German text one gets when requesting the page using curl) 
- R should not attempt to circumvent access restrictions on external 
websites.

Best
Tomas

>
> Steve Berman
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel



More information about the R-devel mailing list