[R] {car} outlierTest looses p/q values

John Fox jfox at mcmaster.ca
Tue Oct 21 15:53:11 CEST 2014


Dear Phil,

I'll bypass questions about why one would want to do this, why you're using glm() rather than lm(), etc., and just point out that the *studentized* residual for the 11th observation is undefined for your example. Simplifying your code:

-------- snip -------

> y <- c(1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,13.98) 
> x1 <- c(4.6, 35.8, 4.1, 5.4, 5.2, 27.9, 48.2, 4.5, 5.7, 4.2, 100.5)
> x2 <- c(0.2990466, 1.0497156, 0.2805028, 0.3517993, 0.3543678, 1.0178604, 
+         0.4933182, 0.2810865, 0.4349550, 0.3269192, 3.0889747)

> mod.glm <- glm(y ~ x1 + x2)

> rstudent(mod.glm)
           1            2            3            4            5            6            7            8            9 
 0.459064418 -2.342404171  0.520003419  0.284592752  0.274181980 -2.283387024  1.236226216  0.521360367  0.007045656 
          10           11 
 0.359359609          NaN 

> residuals(mod.glm)
          1           2           3           4           5           6           7           8           9          10 
 0.66006969 -2.58163338  0.74389676  0.41298738  0.39764981 -2.53508626  0.31282392  0.74657662  0.01020209  0.51813375 
         11 
 1.31437963 

------------- snip ----------

Best,
 John

On Tue, 21 Oct 2014 09:34:07 +0200
 Phil <chobophil at gmail.com> wrote:
> Hi all,
> 
> as John pointed out, there is a way to create settings where the studentized residuals are undefined. However, after cross-checking it seems that the residuals are getting calculated without any error. The problem comes up when I use outlierTest to assign a p,q value,respectively.
> 
> Below is the code and some printouts:
> 
> 
> non.zero.orga.vector <- c(1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,1.00,13.98) #targervector
> 
> print(meta.Matrix[,3])
>   x1         x28          x2      x3      x4         x51     x52     x5 
>      x6      x7     x82
>       4.6     35.8      4.1      5.4      5.2     27.9 48.2      4.5    
> 5.7    4.2    100.5
> 
> print(meta.Matrix[,10])
>       x1                      x28               x2     x3
> 0.2990466     1.0497156     0.2805028 0.3517993
> 
>        x4          x51                x52                  x5
> 0.3543678 1.0178604 0.4933182 0.2810865
> 
>      x6                   x7              x82
> 0.4349550 0.3269192 3.0889747
> 
> glModel <- glm(non.zero.orga.vector ~ meta.Matrix[,3]+meta.Matrix[,10]) #create glm for testing
> 
> print(glModel$residuals) #check if residuals are calculated for every entry in the target vector
>    x1                   x28                x2                x3         
>      x4              x51           x52
>   0.6600696 -2.5816334  0.7438969  0.4129873  0.3976498 -2.5350861 
> 0.3128240
>         x5            x6                    x7           x82
>   0.7465766  0.0102020  0.5181337  1.3143796
> 
> 
> 
>   test.Res <- outlierTest(glModel,digits=4,cutoff=Inf,n.max=Inf) #test 
> the glm for outlier, cutoff=Inf/n.max=Inf == report everything
> 
> print(test.Res$bonf.p) #check if q-value exists
>       x28              x51              x52           x5     x2       
>          x1               x7           x3
> 0.1915995 0.2240759 2.1637438 6.0211575 6.0306110 6.4618792 7.1932613 7.7595619
>        x4               x6
> 7.8394477 9.9437848
> 
> 
> 
> As you see the glm calculates residuals for x82 (which is in fact 1.314) but the outlierTest does not assign a p/q value to it. Does anyone know why?
> 
> Thanks in advance,
> 
> Phil
> 
> On 10/17/2014 08:54 PM, John Fox wrote:
> > Dear Phil,
> >
> > Yes, that's a bit clearer. One can invent data configurations where certain studentized residuals are undefined. For example, try the following:
> >
> > y <- c(0, 0, 0, 0, 0, 1)
> > x <- 1:6
> > xx <- (1:6 - 3.5)^2
> > rstudent(lm(y ~ x))
> > rstudent(lm(y ~ xx))
> > plot(x, y)
> > plot(xx, y)
> >
> > The plots should clarify what's going on.
> >
> > I'm copying to r-help since the discussion began there.
> >
> > I hope this helps,
> >   John
> >
> > On Fri, 17 Oct 2014 18:35:59 +0200
> >   Philip Stevens <chobophil at gmail.com> wrote:
> >> Dear John,
> >>
> >> Thank you for your fast reply. Unfortunately I am out of office right now
> >> but I will get back to you on Monday asap with a toy example and some code.
> >> Meanwhile let me try to explain further.
> >>
> >> Basically not the glm but the outlierTest afterwards fails. And it only
> >> fails if all values, used to set up the glm, are exactly 1 except for one
> >> value which can be arbitrary large. I construct the glm from a vector of
> >> doubles (the target values) and a vector of other numeric values(metadata).
> >> (So this should be fit <- glm(targetVector ~ metadata) ) And want to check
> >> if one of those doubles(in the target vector) is an outlier using
> >> outlierTest(fit). The residuals are calculated by the glm but the
> >> outlierTest does not report a p nor a q value for the one value in the
> >> target vector which is not 1. And I can't figure out why...
> >>
> >> I hope this makes it a bit clearer.
> >> Anyways I will come back to you on Monday.
> >>
> >> Best,
> >>
> >> Phil
> >> Am 17.10.2014 17:43 schrieb "John Fox" <jfox at mcmaster.ca>:
> >>
> >>> Dear Phil,
> >>>
> >>> After reading your posting several times, I still don't understand what
> >>> you did. As usual, having a reproducible example illustrating the error
> >>> would be a great help. I do have a guess about the source of the error:
> >>> glm() failed in some way for the problematic case.
> >>>
> >>> Best,
> >>>   John
> >>>
> >>> ------------------------------------------------
> >>> John Fox, Professor
> >>> McMaster University
> >>> Hamilton, Ontario, Canada
> >>> http://socserv.mcmaster.ca/jfox/
> >>>
> >>>
> >>> On Fri, 17 Oct 2014 12:53:30 +0200
> >>>   Phil <chobophil at gmail.com> wrote:
> >>>> Hi guys,
> >>>>
> >>>> I came across a strange phenomena and can't figure out why it happens by
> >>> myself so here we go.
> >>>> I got a dataframe which consists of double numbers which I want to
> >>> check, row-wise if there are outliers in the rows.
> >>>> So I iterate over the rows and create a glm using the numbers of that
> >>> particular row. Which might look like this:
> >>>> case1)
> >>>>           x1        x2        x3        x4        x5        x6 x7
> >>>> x8        x9        x10        x11
> >>>>       0.00     3.91     0.00     0.00     0.00   68.03   40.39 0.00
> >>>> 0.00      0.00       4.11
> >>>>
> >>>> or like this:
> >>>> case2)
> >>>>           x1        x2        x3        x4        x5        x6 x7
> >>>> x8        x9        x10        x11
> >>>>        1.00     1.00    1.00     1.00     1.00     1.00     1.00 1.00
> >>>> 1.00     1.00      5.34
> >>>>
> >>>> or any other combination of double numbers...
> >>>>
> >>>> however, using a glm like this:
> >>>>
> >>>> glModel <- glm(vector ~ some_other_meta_data_which_is_double_numbers)
> >>>>
> >>>> and testing it with:
> >>>>
> >>>>    test.Res <- outlierTest(glModel,digits=4,cutoff=Inf,n.max=Inf)
> >>>>
> >>>> I always get a result consisting of the desired p and q values but not
> >>> if the vector I use looks like case2. There is no error message and the
> >>> computation does not stop either.
> >>>> However, all p and q values are produced except for the last value x11.
> >>>>
> >>>> Any idea why this particular value gets dropped from the output of the
> >>> outlierTest Method in the car package.
> >>>> Here is the sessioninfo:
> >>>>
> >>>>    sessionInfo()
> >>>> R version 3.1.1 (2014-07-10)
> >>>> Platform: x86_64-redhat-linux-gnu (64-bit)
> >>>>
> >>>> locale:
> >>>>    [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
> >>>>    [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
> >>>>    [5] LC_MONETARY=en_US.utf8    LC_MESSAGES=en_US.utf8
> >>>>    [7] LC_PAPER=en_US.utf8       LC_NAME=C
> >>>>    [9] LC_ADDRESS=C              LC_TELEPHONE=C
> >>>> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
> >>>>
> >>>> attached base packages:
> >>>> [1] stats     graphics  grDevices utils     datasets  methods base
> >>>>
> >>>> other attached packages:
> >>>> [1] ggplot2_1.0.0      car_2.0-21         RColorBrewer_1.0-5 iNEXT_1.0
> >>>> [5] vegan_2.0-10       lattice_0.20-29    permute_0.8-3
> >>>>
> >>>> loaded via a namespace (and not attached):
> >>>>    [1] colorspace_1.2-4 compiler_3.1.1   digest_0.6.4 grid_3.1.1
> >>>>    [5] gtable_0.1.2     labeling_0.3     MASS_7.3-33 munsell_0.4.2
> >>>>    [9] nnet_7.3-8       plyr_1.8.1       proto_0.3-10 Rcpp_0.11.2
> >>>> [13] reshape2_1.4     scales_0.2.4     stringr_0.6.2    tools_3.1.1
> >>>>
> >>>> Any help is highly appreciated.
> >>>>
> >>>> Thanks
> >>>>
> >>>> Phil
> >>>>
> >>>> ______________________________________________
> >>>> R-help at r-project.org mailing list
> >>>> https://stat.ethz.ch/mailman/listinfo/r-help
> >>>> PLEASE do read the posting guide
> >>> http://www.R-project.org/posting-guide.html
> >>>> and provide commented, minimal, self-contained, reproducible code.
> >>>
> >>>
> >>>
> >>>
> > ------------------------------------------------
> > John Fox, Professor
> > McMaster University
> > Hamilton, Ontario, Canada
> > http://socserv.mcmaster.ca/jfox/
> > 	
> > 	
> >
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

------------------------------------------------
John Fox, Professor
McMaster University
Hamilton, Ontario, Canada
http://socserv.mcmaster.ca/jfox/



More information about the R-help mailing list