[Rd] Proposal to limit Internet access during package load

Gabriel Becker g@bembecker @end|ng |rom gm@||@com
Mon Sep 26 22:02:33 CEST 2022


Hi Simon,

The example of this I'm aware of that is most popular and widely used "in
the wild" is the stringi package (which is a dep of the widely used stringr
pkg) whose configure file downloads the ICU Data Library (icudt).

See https://github.com/gagolews/stringi/blob/master/configure#L5412

Note it does have some sort of workaround in place for non-internet-capable
build machines, but it is external (the build in question fails without the
workaround already explicitly performed).

Best,
~G



On Mon, Sep 26, 2022 at 12:50 PM Simon Urbanek <simon.urbanek using r-project.org>
wrote:

>
>
> > On Sep 27, 2022, at 8:25 AM, Iñaki Ucar <iucar using fedoraproject.org> wrote:
> >
> > On Sat, 24 Sept 2022 at 01:55, Simon Urbanek
> > <simon.urbanek using r-project.org> wrote:
> >>
> >> Iñaki,
> >>
> >> I fully agree, this a very common issue since vast majority of server
> deployments I have encountered don't allow internet access. In practice
> this means that such packages are effectively banned.
> >>
> >> I would argue that not even (1) or (2) are really an issue, because in
> fact the CRAN policy doesn't impose any absolute limits on size, it only
> states that the package should be "of minimum necessary size" which means
> it shouldn't waste space. If there is no way to reduce the size without
> impacting functionality, it's perfectly fine.
> >
> > "Packages should be of the minimum necessary size" is subject to
> > interpretation. And in practice, there is an issue with e.g. packages
> > that "bundle" big third-party libraries. There are also packages that
> > require downloading precompiled code, JARs... at installation time.
> >
>
> JARs are part of the package, so that's a valid use, no question there,
> that's how Java packages do this already.
>
> Downloading pre-compiled binaries is something that shouldn't be done and
> a whole can of worms (since those are not sources and it *is* specific to
> the platform, os etc.) that is entirely separate, but worth a separate
> discussion. So I still don't see any use cases for actual sources. I do see
> a need for better specification of external dependencies which are not part
> of the package such that those can be satisfied automatically - but that's
> not the problem you asked about.
>
>
> >> That said, there are exceptions such as very large datasets (e.g., as
> distributed by Bioconductor) which are orders of magnitude larger than what
> is sustainable. I agree that it would be nice to have a mechanism for
> specifying such sources. So yes, I like the idea, but I'd like to see more
> real use cases to justify the effort.
> >
> > "More real use cases" like in "more use cases" or like in "the
> > previous ones are not real ones"? :)
> >
> >> The issue with any online downloads, though, is that there is no
> guarantee of availability - which is real issue for reproducibility. So one
> could argue that if such external sources are required then they should be
> on a well-defined, independent, permanent storage such as Zenodo. This
> could be a matter of policy as opposed to the technical side above which
> would be adding such support to R CMD INSTALL.
> >
> > Not necessarily. If the package declares the additional sources in the
> > DESCRIPTION (probably with hashes), that's a big improvement over the
> > current state of things, in which basically we don't know what the
> > package tries download, then it may fail, and finally there's no
> > guarantee that it's what the author intended in the first place.
> >
> > But on top of this, R could add a CMD to download those, and then some
> > lookaside storage could be used on CRAN. This is e.g. how RPM
> > packaging works: the spec declares all the sources, they are
> > downloaded once, hashed and stored in a lookaside cache. Then package
> > building doesn't need general Internet connectivity, just access to
> > the cache.
> >
>
> Sure, I fully agree that it would be a good first step, but I'm still
> waiting for examples ;).
>
> Cheers,
> Simon
>
>
> > Iñaki
> >
> >>
> >> Cheers,
> >> Simon
> >>
> >>
> >>> On Sep 24, 2022, at 3:22 AM, Iñaki Ucar <iucar using fedoraproject.org>
> wrote:
> >>>
> >>> Hi all,
> >>>
> >>> I'd like to open this debate here, because IMO this is a big issue.
> >>> Many packages do this for various reasons, some more legitimate than
> >>> others, but I think that this shouldn't be allowed, because it
> >>> basically means that installation fails in a machine without Internet
> >>> access (which happens e.g. in Linux distro builders for security
> >>> reasons).
> >>>
> >>> Now, what if connection is suppressed during package load? There are
> >>> basically three use cases out there:
> >>>
> >>> (1) The package requires additional files for the installation (e.g.
> >>> the source code of an external library) that cannot be bundled into
> >>> the package due to CRAN restrictions (size).
> >>> (2) The package requires additional files for using it (e.g.,
> >>> datasets, a JAR...) that cannot be bundled into the package due to
> >>> CRAN restrictions (size).
> >>> (3) Other spurious reasons (e.g. the maintainer decided that package
> >>> load was a good place to check an online service availability, etc.).
> >>>
> >>> Again IMO, (3) shouldn't be allowed in any case; (2) should be a
> >>> separate function that the user actively calls to download the files,
> >>> and those files should be placed into the user dir, and (3) is the
> >>> only legitimate use, but then other mechanism should be provided to
> >>> avoid connections during package load.
> >>>
> >>> My proposal to support (3) would be to add a new field in the
> >>> DESCRIPTION, "Additional_sources", which would be a comma separated
> >>> list of additional resources to download during R CMD INSTALL. Those
> >>> sources would be downloaded by R CMD INSTALL if not provided via an
> >>> option (to support offline installations), and would be placed in a
> >>> predefined place for the package to find and configure them (via an
> >>> environment variable or in a predefined subdirectory).
> >>>
> >>> This proposal has several advantages. Apart from the obvious one
> >>> (Internet access during package load can be limited without losing
> >>> current functionalities), it gives more visibility to the resources
> >>> that packages are using during the installation phase, and thus makes
> >>> those installations more reproducible and more secure.
> >>>
> >>> Best,
> >>> --
> >>> Iñaki Úcar
> >>>
> >>> ______________________________________________
> >>> R-devel using r-project.org mailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >>>
> >>
> >
> >
> > --
> > Iñaki Úcar
> >
> > ______________________________________________
> > R-devel using r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> ______________________________________________
> R-devel using r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
>

	[[alternative HTML version deleted]]



More information about the R-devel mailing list