[Rd] I have corrected a dead link in the treering documentation

Sat Sep 2 23:37:48 CEST 2017

>>>>> Thomas Levine <_ at thomaslevine.com>
>>>>>     on Fri, 1 Sep 2017 13:23:47 +0000 writes:

    > Martin Maechler writes:
    >> There may be one small problem: IIUC, the wayback machine
    >> is a +- private endeavor and really great and phantastic
    >> but it does need (US? tax deductible) donations,
    >> https://archive.org/donate/, to continue thriving.  This
    >> makes me hesitate a bit to link to it within the "base R"
    >> documentation.  But that may be wrong -- and I should
    >> really use it to *help* the project ?

    > I agree that the Wayback Machine is a private
    > endeavor. After reviewing other base library
    > documentation, I have concluded that it would regardless
    > be consistent with current practice to reference it in the
    > base documentation.

    > I share your concern regarding the support of other
    > institutions, and I have found some references that are
    > more problematic to me than the one of present interest. I
    > would thus support an initiative to consider the social
    > implications of the different references and to adjust the
    > references accordingly.

    > Below I start by making a distinction between two types of
    > references that I think should be treated differently in
    > terms of your concern.  Next, I assess whether there is a
    > precedent for inclusion of references to private
    > publishers, as in the present patch; I include that there
    > is such a president. Then I present my opinion regarding
    > the present patch. Finally, I present some other
    > considerations that I find relevant to the discussion.

    > Distinguishing between two link types
    > -------------------------------------
    > For discussion of this issue, I think it is helpful to
    > distinguish between references to sources and references
    > to other materials.

    > In the case of references of to sources, there is little
    > choice but to reference the publisher, even though the
    > overwhelming majority of referenced publishers are private
    > companies that impose restrictive licenses on their
    > journals and books and cannot be reasonably trusted to
    > maintain access to the materials nor availability of
    > webpages.

    > With other references, it is possible to replace the
    > reference with a different document that contains similar
    > information.

    > For example, if a function implements an method based on a
    > particular journal article, that article's citation needs
    > to stay, even if the journal is published by a private
    > institution. On the other hand, if the reference just
    > provides context or suggestions related to usage, then the
    > reference is provided just as information and can be
    > replaced.

    > Precedent for inclusion of private non-source materials
    > -------------------------------------------------------
    > The dead link of interest is only informational, not a
    > citation of a source, and so it could be replaced. So I
    > assessed whether it would match current practice to
    > include it, and I concluded that there is substantial
    > precedent for inclusion of private reference materials
    > other than strict sources. Not having access to a good
    > library at the moment, I have limited my research on this
    > matter to website references.

    > In SVN revision 73164, \url calls are distributed among
    > 148 files, from 1 call to 13 calls per file, with mean of
    > 1.75 and median of 1.

    >   grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq
    > -c | sort -n

    > Total number of library documentation files is 1419.

    >   find src/library/ -name \*.Rd | wc -l

    > I randomly selected 20 matching files for further study.

    >   % grep '\\url' src/library/*/*/*.Rd | cut -d: -f1 | uniq
    > -c | sort -R | head -n 20 | tee /tmp/rd 2
    > src/library/grDevices/man/pdf.Rd 1
    > src/library/base/man/taskCallbackNames.Rd 1
    > src/library/stats/man/shapiro.test.Rd 1
    > src/library/tcltk/man/TkWidgets.Rd 2
    > src/library/graphics/man/assocplot.Rd 1
    > src/library/base/man/sprintf.Rd 6
    > src/library/base/man/regex.Rd 3
    > src/library/datasets/man/HairEyeColor.Rd 1
    > src/library/stats/man/optimize.Rd 1
    > src/library/datasets/man/UKDriverDeaths.Rd 1
    > src/library/utils/man/object.size.Rd 1
    > src/library/utils/man/unzip.Rd 1
    > src/library/base/man/dcf.Rd 1
    > src/library/base/man/DateTimeClasses.Rd 3
    > src/library/stats/man/GammaDist.Rd 2
    > src/library/utils/man/maintainer.Rd 2
    > src/library/base/man/libcurlVersion.Rd 2
    > src/library/base/man/eigen.Rd 2
    > src/library/base/man/chol2inv.Rd 1
    > src/library/tools/man/update_pkg_po.Rd

    >> From these 20 I composed a table with statistical unit of
    >> \url call and
    > with variables filename, url, type of reference, and type
    > of publisher.  The following commands were helpful.

    >   sed -e 's/^[ 0-9]*//' /tmp/rd | xargs grep \\\\url | sed
    > -e 's/$/::/' -e 's/:.*\\url./:/' > urls.csv sed 's/^[
    > 0-9]*//' /tmp/rd | xargs grep -A5 -B5 \\\\url | less

    > I realized that I need to be a bit more precise about what
    > I mean by a "source". I wound up grouping the type of
    > reference for \url calls into the following categories.

    > 1 Necessary sources, such as the specific file from which
    > an algorithm or dataset was copied (as in
    > stats/man/optimize.Rd) 2 Upstream documentation for bound
    > libraries (tcltk/man/TkWidgets.Rd) 3 Extra information,
    > such as tutorials on portable programming referenced in
    > base/man/sprintf.Rd 4 Ambiguous, such as an general
    > introduction on the topic that may have been used during
    > the development of the function or may have been added
    > just as further documentation (as in
    > grDevices/man/pdf.Rd).  These references did not include
    > the date on which the webpage was accessed, so they aren't
    > clear enough to count as source references even if they
    > were in fact used during the development of the function.
    > 5 Comments (stats/man/shapiro.test.Rd) and duplicates
    > (stats/man/GammaDist.Rd)

    > Earlier, I distinguished between references to sources and
    > references to other materials. I think that the first and
    > second categories should be considered the source type
    > references and the third and fourth should be considered
    > the non-source type references.

    > I separated publisher types into the following

    > * academic (I think that they were all public
    > universities, but I did not check very thoroughly.)  *
    > government * private * R project

    > Resulting categorization was as follows (attached urls.csv
    > and urls.r).

    >              publisher source academic government private
    > r-project 1 necessary 8 0 5 3 2 upstream 3 0 6 0 3 extra 0
    > 0 1 1 4 ambiguous 0 1 4 0 5 ignore 0 1 1 1

    > The references of concern are the replaceable sources
    > (types 3 and 4) to private publishers, which account for 5
    > out of the 35 \url calls and 5 of the 20 files. To fit the
    > table in an email, I have truncated the URLs to their
    > domains.

    >                              filename domain
    > src/library/grDevices/man/pdf.Rd en.wikipedia.org
    > src/library/base/man/sprintf.Rd developer.r-project.org
    > src/library/base/man/regex.Rd perldoc.perl.org
    > src/library/utils/man/object.size.Rd en.wikipedia.org
    > src/library/stats/man/GammaDist.Rd en.wikipedia.org

    > Note that the sprintf documentation is in fact a link to
    > an r-project page
    > (https://developer.r-project.org/Portability.html) that
    > has lots of other links on it, including links to
    > fortran.com, en.wikipedia.org, pubs.opengroup.org, and
    > people.redhat.com.

    > So we see that several \url calls reference private
    > publishers even though the links could be replaced with
    > alternatives. By my categorization, 5 out of 20 sampled
    > files (95% confidence interval of 9 files to 65 files of
    > the population of 148 matching files, based on a bespoke
    > t-test with finite population correction because I had
    > trouble compiling the sampling package, see urls.r)
    > include a replaceable reference to a private publisher.

    > I briefly looked through the full population of Rd files,
    > and I got the impression that this sort of private
    > reference may be restricted to a just a few publishers,
    > with Wikimedia possibly being the most prominent.

    > To summarize, several other documentation files already
    > reference private publishers, and the set of publishers is
    > small enough that it would be feasible to review each
    > publisher in order consider whether the references to it
    > should replaced with alternatives.

    > Opinion regarding the present patch
    > -----------------------------------
    > I think that linking to the Wayback Machine, by the
    > Internet Archive, is consistent with the practice in many
    > other base libraries and that it is thus acceptable.

    > At present, base makes no references to the Wayback
    > Machine but makes several references to English
    > Wikipedia. An even more consistent option is thus to link
    > to the English Wikipedia article for Great Basin
    > bristlecone pine
    > (https://en.wikipedia.org/wiki/Pinus_longaeva) or for
    > Methuselah
    > (https://en.wikipedia.org/wiki/Methuselah_(tree)) instead
    > of the Wayback Machine page that I reference in the
    > patch. (Note that the treering data are from a tree in the
    > Methuselah Walk but not from Methuselah itself.)

    > On the other hand, if we are to avoid referencing private
    > institutions unnecessarily, we should create a broader
    > initiative to replace private non-source references in
    > base documentation. For me, more worrisome than references
    > to the Open Group or to Wikimedia are the references to
    > the private company GitHub, as in utils/man/tar.Rd; aside
    > from the social implications of supporting a private
    > company whose repository hosting service has been accessed
    > by the Free Software Foundation as unethical
    > (https://www.gnu.org/software/repo-criteria-evaluation.html),
    > I do not even trust in the long-term availability of its
    > webpages.

    > And of course, if we do not correct the dead link in
    > treering, I think we should remove the dead link. We can
    > optionally replace it with a very short description of
    > Great Basin bristlecone pines.

    > Further discussion
    > ------------------

    > RESTRICTION CRITERIA

    > If R is to have formal restrictions as to what sorts of
    > references may be included in the base documentation, I
    > think that private versus public is not an appropriate
    > criterion.

I was not at all aiming for formal restrictions... I'm sorry if
what I said looked like that.

One reason for not using internet archive links would be that
they are "somewhat costly" for the archive itself, because often
"serving the URL" may be a somewhat expensive data base access
.. and as I said they are not a big fat company with lots of
resources.

Anyway you've made your very good points,
and also for reason of efficiency on this list, I've committed
your proposed change ... with thanks to you.

Martin

    > To start, private universities may be similarly
    > acceptable to public universities, and certain government
    > institutions may be problematic. Also, most of the present
    > references to software specifications refer to private
    > institutions.  Considering the goals of the R project and
    > its status as a component of the GNU project, I think that
    > it would make more sense for the criteria to be based on
    > the license of the referenced work, rather than on
    > characteristics of the legal entity that has published it.

    > AVOIDING LINKS

    > For practical reasons, I think it would be nice to avoid
    > the sort of link that we are presently discussing and
    > instead to distribute the contents of that link. If the
    > contents are incorporated into R, then dead links are not
    > an issue, we are free to edit the extra documentation that
    > otherwise would have been linked, and users can view the
    > documentation without a internet connection. I think that
    > the datasets documentation, in particular, could benefit
    > substantially from a few sentences of context being added
    > to each documentation file.

    > That said, it is possible that this would be enough work
    > that it would not be worthwhile; this extra documentation
    > could easily become much larger than the rest of the R
    > source code, especially if images are included as in the
    > case of the Methuselah Walk photographs, so implementing
    > this would be more involved than simply obtaining
    > acceptable licenses on the extra documentation and copying
    > passages to Rd files.