[Rd] Proposal to limit Internet access during package load

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Tue Sep 27 00:02:08 CEST 2022


For the record, the only things switchr (my package) is doing internet wise
should be hitting the bioconductor config file (
http://bioconductor.org/config.yaml) so that it knows the things it need to
know about Bioc repos/versions/etc (at load time, actually, not install
time, but since install does a test load, those are essentially the same).

I have fallback behavior for when the file can't be read, so there
shouldn't be any actual build breakages/install breakages I don't think,
but the check does happen.

Advice on what to do for the above use case that is better practice is
welcome.

~G

On Mon, Sep 26, 2022 at 2:40 PM Simon Urbanek <simon.urbanek using r-project.org>
wrote:

>
>
> > On 27/09/2022, at 10:21 AM, Iñaki Ucar <iucar using fedoraproject.org> wrote:
> >
> > On Mon, 26 Sept 2022 at 23:07, Simon Urbanek
> > <simon.urbanek using r-project.org> wrote:
> >>
> >> Iñaki,
> >>
> >> I'm not sure I understand - system dependencies are an entirely
> different topic and I would argue a far more important one (very happy to
> start a discussion about that), but that has nothing to do with declaring
> downloads. I assumed your question was about large files in packages which
> packages avoid to ship and download instead so declaring them would be
> useful.
> >
> > Exactly. Maybe there's a misunderstanding, because I didn't talk about
> system dependencies (alas there are packages that try to download things
> that are declared as system dependencies, as Gabe noted). :)
> >
>
>
> Ok, understood. I would like to tackle those as well, but let's start that
> conversation in a few weeks when I have a lot more time.
>
>
> >> And for that, the obvious answer is they shouldn't do that - if a
> package needs a file to run, it should include it. So an easy solution is
> to disallow it.
> >
> > Then we completely agree. My proposal about declaring additional sources
> was because, given that so many packages do this, I thought that I would
> find a strong opposition to this. But if R Core / CRAN is ok with just
> limiting net access at install time, then that's perfect to me. :)
> >
>
> Yes we do agree :). I started looking at your list, and so far those seem
> simply bugs or design deficiencies in the packages (and outright policy
> violations). I think the only reason they exist is that it doesn't get
> detected in CRAN incoming, it's certainly not intentional.
>
> Cheers,
> Simon
>
>
> > Iñaki
> >
> >> But so far all examples where just (ab)use of downloads for binary
> dependencies which is an entirely different issue that needs a different
> solution (in a naive way declaring such dependencies, but we know it's not
> that simple - and download URLs don't help there).
> >>
> >> Cheers,
> >> Simon
> >>
> >>
> >>> On 27/09/2022, at 8:25 AM,  Ucar <iucar using fedoraproject.org> wrote:
> >>>
> >>> On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
> >>> <simon.urbanek using r-project.org> wrote:
> >>>>
> >>>> Iñaki,
> >>>>
> >>>> I fully agree, this a very common issue since vast majority of server
> deployments I have encountered don't allow internet access. In practice
> this means that such packages are effectively banned.
> >>>>
> >>>> I would argue that not even (1) or (2) are really an issue, because
> in fact the CRAN policy doesn't impose any absolute limits on size, it only
> states that the package should be "of minimum necessary size" which means
> it shouldn't waste space. If there is no way to reduce the size without
> impacting functionality, it's perfectly fine.
> >>>
> >>> "Packages should be of the minimum necessary size" is subject to
> >>> interpretation. And in practice, there is an issue with e.g. packages
> >>> that "bundle" big third-party libraries. There are also packages that
> >>> require downloading precompiled code, JARs... at installation time.
> >>>
> >>>> That said, there are exceptions such as very large datasets (e.g., as
> distributed by Bioconductor) which are orders of magnitude larger than what
> is sustainable. I agree that it would be nice to have a mechanism for
> specifying such sources. So yes, I like the idea, but I'd like to see more
> real use cases to justify the effort.
> >>>
> >>> "More real use cases" like in "more use cases" or like in "the
> >>> previous ones are not real ones"? :)
> >>>
> >>>> The issue with any online downloads, though, is that there is no
> guarantee of availability - which is real issue for reproducibility. So one
> could argue that if such external sources are required then they should be
> on a well-defined, independent, permanent storage such as Zenodo. This
> could be a matter of policy as opposed to the technical side above which
> would be adding such support to R CMD INSTALL.
> >>>
> >>> Not necessarily. If the package declares the additional sources in the
> >>> DESCRIPTION (probably with hashes), that's a big improvement over the
> >>> current state of things, in which basically we don't know what the
> >>> package tries download, then it may fail, and finally there's no
> >>> guarantee that it's what the author intended in the first place.
> >>>
> >>> But on top of this, R could add a CMD to download those, and then some
> >>> lookaside storage could be used on CRAN. This is e.g. how RPM
> >>> packaging works: the spec declares all the sources, they are
> >>> downloaded once, hashed and stored in a lookaside cache. Then package
> >>> building doesn't need general Internet connectivity, just access to
> >>> the cache.
> >>>
> >>> Iñaki
> >>>
> >>>>
> >>>> Cheers,
> >>>> Simon
> >>>>
> >>>>
> >>>>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iucar using fedoraproject.org>
> wrote:
> >>>>>
> >>>>> Hi all,
> >>>>>
> >>>>> I'd like to open this debate here, because IMO this is a big issue.
> >>>>> Many packages do this for various reasons, some more legitimate than
> >>>>> others, but I think that this shouldn't be allowed, because it
> >>>>> basically means that installation fails in a machine without Internet
> >>>>> access (which happens e.g. in Linux distro builders for security
> >>>>> reasons).
> >>>>>
> >>>>> Now, what if connection is suppressed during package load? There are
> >>>>> basically three use cases out there:
> >>>>>
> >>>>> (1) The package requires additional files for the installation (e.g.
> >>>>> the source code of an external library) that cannot be bundled into
> >>>>> the package due to CRAN restrictions (size).
> >>>>> (2) The package requires additional files for using it (e.g.,
> >>>>> datasets, a JAR...) that cannot be bundled into the package due to
> >>>>> CRAN restrictions (size).
> >>>>> (3) Other spurious reasons (e.g. the maintainer decided that package
> >>>>> load was a good place to check an online service availability, etc.).
> >>>>>
> >>>>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
> >>>>> separate function that the user actively calls to download the files,
> >>>>> and those files should be placed into the user dir, and (3) is the
> >>>>> only legitimate use, but then other mechanism should be provided to
> >>>>> avoid connections during package load.
> >>>>>
> >>>>> My proposal to support (3) would be to add a new field in the
> >>>>> DESCRIPTION, "Additional_sources", which would be a comma separated
> >>>>> list of additional resources to download during R CMD INSTALL. Those
> >>>>> sources would be downloaded by R CMD INSTALL if not provided via an
> >>>>> option (to support offline installations), and would be placed in a
> >>>>> predefined place for the package to find and configure them (via an
> >>>>> environment variable or in a predefined subdirectory).
> >>>>>
> >>>>> This proposal has several advantages. Apart from the obvious one
> >>>>> (Internet access during package load can be limited without losing
> >>>>> current functionalities), it gives more visibility to the resources
> >>>>> that packages are using during the installation phase, and thus makes
> >>>>> those installations more reproducible and more secure.
> >>>>>
> >>>>> Best,
> >>>>> --
> >>>>> Iñaki Úcar
> >>>>>
> >>>>> ______________________________________________
> >>>>> R-devel using r-project.org mailing list
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>>>
> >>>>
> >>>
> >>>
> >>> --
> >>> Iñaki Úcar
> >>
> >
> >
> > --
> > Iñaki Úcar
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list