[R] Potential Issue with lm.influence

Wed Apr 3 16:20:45 CEST 2019

Hey John,

Seems fair, and, I agree a more explicit or clear (ie, giving users
indications as to why/when the lm.influence is going to misfit the data)
warning makes sense in context.

Sincerely,
Eric

On Wed, Apr 3, 2019 at 10:18 AM Fox, John <jfox using mcmaster.ca> wrote:

> Dear Eric,
>
> I'm afraid that your argument doesn't make sense to me. As you saw when
> you tried
>
>         fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn ")))
>
> glm.nb() effectively wasn't able to estimate the theta parameter of the
> negative binomial model. So why would it be better to base deletion
> diagnostics on actually refitting the model?
>
> The lesson to me here is that if you fit a sufficiently unreasonable model
> to data, the computations may break down. Other than drawing attention to
> the NaN with an explicit warning, I don't see what more could usefully be
> done.
>
> Best,
>  John
>
> > On Apr 2, 2019, at 9:08 PM, Eric Bridgeford <ericwb95 using gmail.com> wrote:
> >
> > Hey John,
> >
> > I am aware they are high leverage points, and that the model is not the
> > best for them. The purpose of this dataset was to explore high leverage
> > points, and diagnostic statistics through which one would identify them.
> >
> > What I am saying is that the current behavior of the function seems a
> > little non-specific to me; the influence for this problem is
> > finite/computable manually by fitting n models to n-1 points (manually
> > holding out each point individually to obtain the loo-variance, and
> > computing the influence in the non-approximate way).
> >
> > I am just suggesting that it seems the function could be improved by,
> say,
> > throwing specific warnings when NaNs may arise. Ie, "Your have points
> that
> > are very high leverage. The approximation technique is not numerically
> > stable for these points and the results should be used with caution"
> > etc...; I am sure there are other also pre-hoc approaches to diagnose
> other
> > ways in which this function could fail). The approximation technique not
> > behaving well for points that are ultra high leverage just seems peculiar
> > that that would return an NaN with no other
> recommendations/advice/specific
> > warnings, especially since the influence is frequently used to diagnosing
> > this specific issue.
> >
> > Alternatively, one could afford an optional argument type="manual" that
> > computes the held-out variance manually rather than the approximate
> > fashion, and add a comment to use this in the help menu when you have
> high
> > leverage points (this is what I ended up doing to obtain the true
> influence
> > and the externally studentized residual).
> >
> > I just think some more specificity could be of use for future users, to
> > make the R:stats community even better :) Does that make sense?
> >
> > Sincerely,
> > Eric
> >
> > On Tue, Apr 2, 2019 at 7:53 PM Fox, John <jfox using mcmaster.ca> wrote:
> >
> >> Dear Eric,
> >>
> >> Have you looked at your data? -- for example:
> >>
> >>        plot(log(Moons) ~ Volume, data = moon_data)
> >>        text(log(Moons) ~ Volume, data = moon_data, labels=Name, adj=1,
> >> subset = Volume > 400)
> >>
> >> The negative-binomial model doesn't look reasonable, does it?
> >>
> >> After you eliminate Jupiter there's one very high leverage point left,
> >> Saturn. Computing studentized residuals entails an approximation to
> >> deleting that as well from the model, so try fitting
> >>
> >>        fit3 <- update(fit, subset = !(Name %in% c("Jupiter ", "Saturn
> ")))
> >>        summary(fit3)
> >>
> >> which runs into numeric difficulties.
> >>
> >> Then look at:
> >>
> >>        plot(log(Moons) ~ Volume, data = moon_data, subset = Volume <
> 400)
> >>
> >> Finally, try
> >>
> >>        plot(log(Moons) ~ log(Volume), data = moon_data)
> >>        fit4 <- update(fit2, . ~ log(Volume))
> >>        rstudent(fit4)
> >>
> >> I hope this helps,
> >> John
> >>
> >> -----------------------------------------------------------------
> >> John Fox
> >> Professor Emeritus
> >> McMaster University
> >> Hamilton, Ontario, Canada
> >> Web: https://socialsciences.mcmaster.ca/jfox/
> >>
> >>
> >>
> >>
> >>> -----Original Message-----
> >>> From: R-help [mailto:r-help-bounces using r-project.org] On Behalf Of Eric
> >>> Bridgeford
> >>> Sent: Tuesday, April 2, 2019 5:01 PM
> >>> To: Bert Gunter <bgunter.4567 using gmail.com>
> >>> Cc: R-help <r-help using r-project.org>
> >>> Subject: Re: [R] Fwd: Potential Issue with lm.influence
> >>>
> >>> I agree the influence documentation suggests NaNs may result; however,
> as
> >>> these can be manually computed and are, indeed, finite/existing (ie,
> >>> computing the held-out influence by manually training n models for n
> >> points
> >>> to obtain n leave one out influence measures), I don't possibly see how
> >> the
> >>> function SHOULD return NaN, and given that it is returning NaN, that
> >>> suggests to me that there should be either a) Providing an alternative
> >>> method to compute them that (may be slower) that returns the correct
> >>> results in the even that lm.influence does not return a good
> >> approximation
> >>> (ie, a command line argument for type="approx" that does the
> >>> approximation strategy employed currently, or an alternative
> >> type="direct"
> >>> or something like that that computes them manually), or b) a heuristic
> to
> >>> suggest why NaNs might result from one's particular inputs/what can be
> >>> done to fix it (if the approximation strategy is the source of the
> >> problem) or
> >>> what the issue is with the data that will cause NaNs. Hence I was
> >> looking to
> >>> start a discussion around the specific strategy employed to compute the
> >>> elements.
> >>>
> >>> Below is the code:
> >>> moon_data <- structure(list(Name = structure(c(8L, 13L, 2L, 7L, 1L, 5L,
> >> 11L,
> >>>                                               12L, 9L, 10L, 4L, 6L,
> >> 3L), .Label = c("Ceres ", "Earth",
> >>> "Eris ",
> >>>
> >>>         "Haumea ", "Jupiter ", "Makemake ", "Mars ", "Mercury ",
> >> "Neptune ",
> >>>
> >>>         "Pluto ", "Saturn ", "Uranus ", "Venus "), class = "factor"),
> >>>                            Distance = c(0.39, 0.72, 1, 1.52, 2.75, 5.2,
> >> 9.54, 19.22,
> >>>                                         30.06, 39.5, 43.35, 45.8,
> >> 67.7), Diameter = c(0.382, 0.949,
> >>>
> >>>           1, 0.532, 0.08, 11.209, 9.449, 4.007, 3.883, 0.18, 0.15,
> >>>
> >>>           0.12, 0.19), Mass = c(0.06, 0.82, 1, 0.11, 2e-04, 317.8,
> >>>
> >>>                                 95.2, 14.6, 17.2, 0.0022, 7e-04, 7e-04,
> >> 0.0025), Moons = c(0L,
> >>>
> >>>
> >>>                0L, 1L, 2L, 0L, 64L, 62L, 27L, 13L, 4L, 2L, 0L, 1L),
> >> Volume =
> >>> c(0.0291869497930152,
> >>>
> >>>
> >>>
> >>>    0.447504348276571, 0.523598775598299, 0.0788376225681443,
> >>>
> >>>
> >>>
> >>>    0.000268082573106329, 737.393372232996, 441.729261571372,
> >>>
> >>>
> >>>
> >>>    33.6865588825666, 30.6549628355953, 0.00305362805928928,
> >>>
> >>>
> >>>
> >>>    0.00176714586764426, 0.00090477868423386, 0.00359136400182873
> >>>
> >>>
> >>>                )), row.names = c(NA, -13L), class = "data.frame")
> >>>
> >>> fit <- glm.nb(Moons ~ Volume, data = moon_data)
> >>> rstudent(fit)
> >>>
> >>> fit2 <- update(fit, subset = Name != "Jupiter ")
> >>> rstudent(fit2)
> >>>
> >>> influence(fit2)$sigma
> >>>
> >>> #        1        2        3        4        5        7        8
>   9
> >>>     10       11       12       13
> >>> # 1.077945 1.077813 1.165025 1.181685 1.077954      NaN 1.044454
> 1.152110
> >>> 1.187586 1.181696 1.077954 1.165147
> >>>
> >>> Sincerely,
> >>> Eric
> >>>
> >>> On Tue, Apr 2, 2019 at 4:38 PM Bert Gunter <bgunter.4567 using gmail.com>
> >>> wrote:
> >>>
> >>>> Also, I suggest you read ?influence which may explain the source of
> >>>> your NaN's .
> >>>>
> >>>> Bert Gunter
> >>>>
> >>>> "The trouble with having an open mind is that people keep coming along
> >>>> and sticking things into it."
> >>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>
> >>>>
> >>>> On Tue, Apr 2, 2019 at 1:29 PM Bert Gunter <bgunter.4567 using gmail.com>
> >>> wrote:
> >>>>
> >>>>> I told you already: **Include code inline **
> >>>>>
> >>>>> See ?dput for how to include a text version of objects, such as data
> >>>>> frames, inline.
> >>>>>
> >>>>> Otherwise, I believe .txt text files are not stripped if you insist
> >>>>> on
> >>>>> *attaching* data or code. Others may have better advice.
> >>>>>
> >>>>>
> >>>>> Bert Gunter
> >>>>>
> >>>>> "The trouble with having an open mind is that people keep coming
> >>>>> along and sticking things into it."
> >>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>
> >>>>>
> >>>>> On Tue, Apr 2, 2019 at 1:21 PM Eric Bridgeford <ericwb95 using gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> How can I add attachments? The following two files were attached in
> >>>>>> the initial message
> >>>>>>
> >>>>>> On Tue, Apr 2, 2019 at 3:34 PM Bert Gunter <bgunter.4567 using gmail.com>
> >>>>>> wrote:
> >>>>>>
> >>>>>>> Nothing was attached. The r-help server strips most attachments.
> >>>>>>> Include your code inline.
> >>>>>>>
> >>>>>>> Also note that
> >>>>>>>
> >>>>>>>> 0/0
> >>>>>>> [1] NaN
> >>>>>>>
> >>>>>>> so maybe something like that occurs in the course of your
> >> calculations.
> >>>>>>> But that's just a guess, so feel free to disregard.
> >>>>>>>
> >>>>>>>
> >>>>>>> Bert Gunter
> >>>>>>>
> >>>>>>> "The trouble with having an open mind is that people keep coming
> >>>>>>> along and sticking things into it."
> >>>>>>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
> >>>>>>>
> >>>>>>>
> >>>>>>> On Tue, Apr 2, 2019 at 11:32 AM Eric Bridgeford
> >>>>>>> <ericwb95 using gmail.com>
> >>>>>>> wrote:
> >>>>>>>
> >>>>>>>> Hi R core team,
> >>>>>>>>
> >>>>>>>> I experienced the following issue with the attached data/code
> >>>>>>>> snippet, where the studentized residual for a single observation
> >>>>>>>> appears to be NaN given finite predictors/responses, which appears
> >>>>>>>> to be driven by the glm.influence method in the stats package. I
> >>>>>>>> am curious to whether this is a consequence of the specific
> >>>>>>>> implementation used for computing the influence, which it would
> >>>>>>>> appear is the driving force for the NaN influence for the point,
> >>>>>>>> that I was ultimately able to trace back through the lm.influence
> >>>>>>>> method to this specific line <
> >>>>>>>> https://github.com/SurajGupta/r-
> >>> source/blob/a28e609e72ed7c47f6ddfb
> >>>>>>>> b86c85279a0750f0b7/src/library/stats/R/lm.influence.R#L67
> >>>>>>>>>
> >>>>>>>> which
> >>>>>>>> calls C code which calls iminfl.f
> >>>>>>>> <
> >>>>>>>>
> https://github.com/SurajGupta/r-source/blob/master/src/library/sta
> >>>>>>>> ts/src/lminfl.f
> >>>>>>>>>
> >>>>>>>> (I
> >>>>>>>> don't know fortran so I can't debug further). My understanding is
> >>>>>>>> that the specific issue would have to do with the leave-one-out
> >>>>>>>> variance estimate associated with this particular point, which it
> >>>>>>>> seems based on my understanding should be finite given finite
> >>>>>>>> predictors/responses. Let me know. Thanks!
> >>>>>>>>
> >>>>>>>> Sincerely,
> >>>>>>>>
> >>>>>>>> --
> >>>>>>>> Eric Bridgeford
> >>>>>>>> ericwb.me
> >>>>>>>> ______________________________________________
> >>>>>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>>>>>> PLEASE do read the posting guide
> >>>>>>>> http://www.R-project.org/posting-guide.html
> >>>>>>>> and provide commented, minimal, self-contained, reproducible code.
> >>>>>>>>
> >>>>>>>
> >>>>>>
> >>>>>> --
> >>>>>> Eric Bridgeford
> >>>>>> ericwb.me
> >>>>>>
> >>>>>
> >>>
> >>> --
> >>> Eric Bridgeford
> >>> ericwb.me
> >>>
> >>>      [[alternative HTML version deleted]]
> >>>
> >>> ______________________________________________
> >>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>> PLEASE do read the posting guide http://www.R-project.org/posting-
> >>> guide.html
> >>> and provide commented, minimal, self-contained, reproducible code.
> >>
> >
> >
> > --
> > Eric Bridgeford
> > ericwb.me
> >
> >       [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
>

-- 
Eric Bridgeford
ericwb.me

	[[alternative HTML version deleted]]