[Rd] Proposal to limit Internet access during package load

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Tue Sep 27 05:18:39 CEST 2022


Ah, thats embarrassing. Thats a bug in how/where I handle lack of
connectivity, rather than me not doing it. I've just push a fix to the
github repo that now cleanly passes check  with no internet connectivity
(much more stringent).

Using a canned file is a bit odd, because in the case where there's no
connectivity, the package  won't work (the canned file would just set the
repositories to URLs that R still won't be able to reach).

Anyway,
Thanks
~G

On Mon, Sep 26, 2022 at 3:11 PM Simon Urbanek <simon.urbanek using r-project.org>
wrote:

>
>
> > On 27/09/2022, at 11:02 AM, Gabriel Becker <gabembecker using gmail.com>
> wrote:
> >
> > For the record, the only things switchr (my package) is doing internet
> wise should be hitting the bioconductor config file (
> http://bioconductor.org/config.yaml) so that it knows the things it need
> to know about Bioc repos/versions/etc (at load time, actually, not install
> time, but since install does a test load, those are essentially the same).
> >
> > I have fallback behavior for when the file can't be read, so there
> shouldn't be any actual build breakages/install breakages I don't think,
> but the check does happen.
> >
>
> $ sandbox-exec -n no-network R CMD INSTALL switchr_0.14.5.tar.gz
> [...]
> ** testing if installed package can be loaded from final location
> Error in readLines(con) :
>   cannot open the connection to 'http://bioconductor.org/config.yaml'
> Calls: <Anonymous> ... getBiocDevelVr -> getBiocYaml -> inet_handlers ->
> readLines
> Execution halted
> ERROR: loading failed
>
> So, yes, it does break. You should recover from the error and use a
> fall-back file that you ship.
>
> Cheers,
> Simon
>
>
> > Advice on what to do for the above use case that is better practice is
> welcome.
> >
> > ~G
> >
> > On Mon, Sep 26, 2022 at 2:40 PM Simon Urbanek <
> simon.urbanek using r-project.org> wrote:
> >
> >
> > > On 27/09/2022, at 10:21 AM, Iñaki Ucar <iucar using fedoraproject.org>
> wrote:
> > >
> > > On Mon, 26 Sept 2022 at 23:07, Simon Urbanek
> > > <simon.urbanek using r-project.org> wrote:
> > >>
> > >> Iñaki,
> > >>
> > >> I'm not sure I understand - system dependencies are an entirely
> different topic and I would argue a far more important one (very happy to
> start a discussion about that), but that has nothing to do with declaring
> downloads. I assumed your question was about large files in packages which
> packages avoid to ship and download instead so declaring them would be
> useful.
> > >
> > > Exactly. Maybe there's a misunderstanding, because I didn't talk about
> system dependencies (alas there are packages that try to download things
> that are declared as system dependencies, as Gabe noted). :)
> > >
> >
> >
> > Ok, understood. I would like to tackle those as well, but let's start
> that conversation in a few weeks when I have a lot more time.
> >
> >
> > >> And for that, the obvious answer is they shouldn't do that - if a
> package needs a file to run, it should include it. So an easy solution is
> to disallow it.
> > >
> > > Then we completely agree. My proposal about declaring additional
> sources was because, given that so many packages do this, I thought that I
> would find a strong opposition to this. But if R Core / CRAN is ok with
> just limiting net access at install time, then that's perfect to me. :)
> > >
> >
> > Yes we do agree :). I started looking at your list, and so far those
> seem simply bugs or design deficiencies in the packages (and outright
> policy violations). I think the only reason they exist is that it doesn't
> get detected in CRAN incoming, it's certainly not intentional.
> >
> > Cheers,
> > Simon
> >
> >
> > > Iñaki
> > >
> > >> But so far all examples where just (ab)use of downloads for binary
> dependencies which is an entirely different issue that needs a different
> solution (in a naive way declaring such dependencies, but we know it's not
> that simple - and download URLs don't help there).
> > >>
> > >> Cheers,
> > >> Simon
> > >>
> > >>
> > >>> On 27/09/2022, at 8:25 AM,  Ucar <iucar using fedoraproject.org> wrote:
> > >>>
> > >>> On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
> > >>> <simon.urbanek using r-project.org> wrote:
> > >>>>
> > >>>> Iñaki,
> > >>>>
> > >>>> I fully agree, this a very common issue since vast majority of
> server deployments I have encountered don't allow internet access. In
> practice this means that such packages are effectively banned.
> > >>>>
> > >>>> I would argue that not even (1) or (2) are really an issue, because
> in fact the CRAN policy doesn't impose any absolute limits on size, it only
> states that the package should be "of minimum necessary size" which means
> it shouldn't waste space. If there is no way to reduce the size without
> impacting functionality, it's perfectly fine.
> > >>>
> > >>> "Packages should be of the minimum necessary size" is subject to
> > >>> interpretation. And in practice, there is an issue with e.g. packages
> > >>> that "bundle" big third-party libraries. There are also packages that
> > >>> require downloading precompiled code, JARs... at installation time.
> > >>>
> > >>>> That said, there are exceptions such as very large datasets (e.g.,
> as distributed by Bioconductor) which are orders of magnitude larger than
> what is sustainable. I agree that it would be nice to have a mechanism for
> specifying such sources. So yes, I like the idea, but I'd like to see more
> real use cases to justify the effort.
> > >>>
> > >>> "More real use cases" like in "more use cases" or like in "the
> > >>> previous ones are not real ones"? :)
> > >>>
> > >>>> The issue with any online downloads, though, is that there is no
> guarantee of availability - which is real issue for reproducibility. So one
> could argue that if such external sources are required then they should be
> on a well-defined, independent, permanent storage such as Zenodo. This
> could be a matter of policy as opposed to the technical side above which
> would be adding such support to R CMD INSTALL.
> > >>>
> > >>> Not necessarily. If the package declares the additional sources in
> the
> > >>> DESCRIPTION (probably with hashes), that's a big improvement over the
> > >>> current state of things, in which basically we don't know what the
> > >>> package tries download, then it may fail, and finally there's no
> > >>> guarantee that it's what the author intended in the first place.
> > >>>
> > >>> But on top of this, R could add a CMD to download those, and then
> some
> > >>> lookaside storage could be used on CRAN. This is e.g. how RPM
> > >>> packaging works: the spec declares all the sources, they are
> > >>> downloaded once, hashed and stored in a lookaside cache. Then package
> > >>> building doesn't need general Internet connectivity, just access to
> > >>> the cache.
> > >>>
> > >>> Iñaki
> > >>>
> > >>>>
> > >>>> Cheers,
> > >>>> Simon
> > >>>>
> > >>>>
> > >>>>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iucar using fedoraproject.org>
> wrote:
> > >>>>>
> > >>>>> Hi all,
> > >>>>>
> > >>>>> I'd like to open this debate here, because IMO this is a big issue.
> > >>>>> Many packages do this for various reasons, some more legitimate
> than
> > >>>>> others, but I think that this shouldn't be allowed, because it
> > >>>>> basically means that installation fails in a machine without
> Internet
> > >>>>> access (which happens e.g. in Linux distro builders for security
> > >>>>> reasons).
> > >>>>>
> > >>>>> Now, what if connection is suppressed during package load? There
> are
> > >>>>> basically three use cases out there:
> > >>>>>
> > >>>>> (1) The package requires additional files for the installation
> (e.g.
> > >>>>> the source code of an external library) that cannot be bundled into
> > >>>>> the package due to CRAN restrictions (size).
> > >>>>> (2) The package requires additional files for using it (e.g.,
> > >>>>> datasets, a JAR...) that cannot be bundled into the package due to
> > >>>>> CRAN restrictions (size).
> > >>>>> (3) Other spurious reasons (e.g. the maintainer decided that
> package
> > >>>>> load was a good place to check an online service availability,
> etc.).
> > >>>>>
> > >>>>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
> > >>>>> separate function that the user actively calls to download the
> files,
> > >>>>> and those files should be placed into the user dir, and (3) is the
> > >>>>> only legitimate use, but then other mechanism should be provided to
> > >>>>> avoid connections during package load.
> > >>>>>
> > >>>>> My proposal to support (3) would be to add a new field in the
> > >>>>> DESCRIPTION, "Additional_sources", which would be a comma separated
> > >>>>> list of additional resources to download during R CMD INSTALL.
> Those
> > >>>>> sources would be downloaded by R CMD INSTALL if not provided via an
> > >>>>> option (to support offline installations), and would be placed in a
> > >>>>> predefined place for the package to find and configure them (via an
> > >>>>> environment variable or in a predefined subdirectory).
> > >>>>>
> > >>>>> This proposal has several advantages. Apart from the obvious one
> > >>>>> (Internet access during package load can be limited without losing
> > >>>>> current functionalities), it gives more visibility to the resources
> > >>>>> that packages are using during the installation phase, and thus
> makes
> > >>>>> those installations more reproducible and more secure.
> > >>>>>
> > >>>>> Best,
> > >>>>> --
> > >>>>> Iñaki Úcar
> > >>>>>
> > >>>>> ______________________________________________
> > >>>>> R-devel using r-project.org mailing list
> > >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> > >>>>>
> > >>>>
> > >>>
> > >>>
> > >>> --
> > >>> Iñaki Úcar
> > >>
> > >
> > >
> > > --
> > > Iñaki Úcar
> > >
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list