[R] grubbs test to detect all outliers

Sat Apr 29 15:18:13 CEST 2023

Às 14:01 de 29/04/2023, AbouEl-Makarim Aboueissa escreveu:
> Hi Rui:
> 
> 
> How about this dataset, please see below. I included a few outliers in each
> column, as you can see in the printed dataset; please see below.
> 
> 
> Once again, thank you very much, and sorry if I bothered you all.
> 
> abou
> 
> 
> 
>> dput(datafortest)
> structure(list(factor1 = structure(c(1L, 1L, 1L, 1L, 1L, 1L,
> 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L,
> 3L, 3L, NA, NA, NA, NA), levels = c("1", "2", "3"), class = "factor"),
>      X = c(994455.077, 4348.031, 9999.789, 3813.139, 12.65, 5642.667,
>      876684.386, 5165.731, NA, 3259.241, 8.383, 1997.878, 99990.608,
>      2655.977, 9.49, 1826.851, 4386.002, 883295.091, 2120.902,
>      NA, 2056.123, 5.088, NA, 92539.873, NA, NA, NA, NA), Y = c(76888L,
>      333L, 618L, 10L, 344L, NA, 3L, 86999L, 265L, 557L, 77777L,
>      383L, NA, NA, 87777L, 287L, 352L, 308L, 999526L, 489L, 2L,
>      444L, 9L, 333L, NA, NA, NA, NA), factor2 = structure(c(1L,
>      1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L,
>      2L, 2L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), levels = c("1",
>      "2", "3"), class = "factor"), Z = c(54999L, 475L, 15L, 603L,
>      442L, 79486L, 927L, 971L, 388L, 888L, 514L, 409L, 546L, 523L,
>      313L, 296L, 320L, 388L, 79999L, 677L, 555L, NA, 479L, 257L,
>      313L, 21L, 320L, 4L), U = c(NA, NA, 1.5, 332, 216, 217, 1000,
>      10, 9999, 444, NA, 5, 327, 58888, 456, 412, 251, 6, 398,
>      438, 428, 15, NA, 406, 334, 465, 180, 88999), V = c(12, 240,
>      9000, 265, NA, 99999, 1, 562, 13, 777, 322, NA, 99988, 653,
>      450, 576, NA, 396.5, 91888, 5, 219, NA, 321, 417, 409, 999999,
>      523, 10)), row.names = c(NA, -28L), class = "data.frame")
>>
> 
> 
> 
>> datafortest
>     factor1          X      Y factor2     Z       U        V
> 1        1 994455.077  76888       1 54999      NA     12.0
> 2        1   4348.031    333       1   475      NA    240.0
> 3        1   9999.789    618       1    15     1.5   9000.0
> 4        1   3813.139     10       1   603   332.0    265.0
> 5        1     12.650    344       1   442   216.0       NA
> 6        1   5642.667     NA       1 79486   217.0  99999.0
> 7        1 876684.386      3       1   927  1000.0      1.0
> 8        2   5165.731  86999       1   971    10.0    562.0
> 9        2         NA    265       1   388  9999.0     13.0
> 10       2   3259.241    557       2   888   444.0    777.0
> 11       2      8.383  77777       2   514      NA    322.0
> 12       2   1997.878    383       2   409     5.0       NA
> 13       2  99990.608     NA       2   546   327.0  99988.0
> 14       2   2655.977     NA       2   523 58888.0    653.0
> 15       3      9.490  87777       2   313   456.0    450.0
> 16       3   1826.851    287       2   296   412.0    576.0
> 17       3   4386.002    352       2   320   251.0       NA
> 18       3 883295.091    308       2   388     6.0    396.5
> 19       3   2120.902 999526       3 79999   398.0  91888.0
> 20       3         NA    489       3   677   438.0      5.0
> 21       3   2056.123      2       3   555   428.0    219.0
> 22       3      5.088    444       3    NA    15.0       NA
> 23       3         NA      9       3   479      NA    321.0
> 24       3  92539.873    333       3   257   406.0    417.0
> 25    <NA>         NA     NA       3   313   334.0    409.0
> 26    <NA>         NA     NA       3    21   465.0 999999.0
> 27    <NA>         NA     NA       3   320   180.0    523.0
> 28    <NA>         NA     NA       3     4 88999.0     10.0
>>
> 
> 
> 
> with many thanks
> abou
> 
> ______________________
> 
> 
> *AbouEl-Makarim Aboueissa, PhD*
> 
> *Professor, Mathematics and Statistics*
> *Graduate Coordinator*
> 
> *Department of Mathematics and Statistics*
> *University of Southern Maine*
> 
> 
> 
> On Sat, Apr 29, 2023 at 8:05 AM Rui Barradas <ruipbarradas using sapo.pt> wrote:
> 
>> Às 14:09 de 28/04/2023, AbouEl-Makarim Aboueissa escreveu:
>>> *R: *Grubbs Test to detect all outliers Per group for all columns in a
>> data
>>> frame
>>>
>>>
>>>
>>> Dear All: good morning
>>>
>>> I have a dataset (as an example) with two column factors (factor1 and
>>> factor2) and 5 numerical columns (X,Y,Z,U,V). The X and Y columns have
>> same
>>> length as factor1; and Z, U, and V have same length as factor2. Please
>> see
>>> dataset is copied below. Please note that all dataset columns have NAs
>>> values.
>>>
>>> *Need help on this:*
>>>
>>>
>>> Can we use the grubbs.test() function to detect all outliers and replace
>> it
>>> by NA in X and Y datasets per group in factor1; and in Z, U, and V
>> datasets
>>> per group in factor2. Columns in the dataframe have different lengths,
>> but
>>> when I read the .csv file, R added NA values for the shorter columns.
>>>
>>> If you need the .csv data file, please let me know.
>>>
>>>
>>> Thank you very much for your help in advance.
>>>
>>>
>>>
>>>
>>> install.packages("outliers")
>>> library(outliers)
>>>
>>> datafortest<-read.csv("G:/data_for_test.csv", header=TRUE)
>>> datafortest
>>>
>>> datafortest<-data.frame(datafortest)
>>>
>>> datafortest$factor1<-as.factor(datafortest$factor1)
>>> datafortest$factor2<-as.factor(datafortest$factor2)
>>>
>>> str(datafortest)
>>>
>>> ##### tried to use grubbs.test() on a single column of the dataframe, but
>>> still not working
>>> tests.for.outliers.X<- grubbs.test(datafortest$X, na.rm = TRUE, type=11)
>>>
>>>
>>> ####################################
>>>
>>> *grubbs.test() on a single dataset: but this can only detect if the min
>> and
>>> the max are outliers.*
>>>
>>>
>>> xx999<-c(0.088,1,2,3,4,5,6,7,8,9,88,98,99)
>>> grubbs.test(xx999, type=11)
>>>
>>>
>>>
>>>
>>> With many thanks
>>>
>>> Abou
>>>
>>>
>>>
>>> factor1      X            Y         factor2          Z           U
>>>     V
>>> 1     4455.077 888 1 999           NA 999
>>> 1     4348.031 333 1 475            NA 240
>>> 1    9999.789 618 1 507 252 394
>>> 1    3813.139 417 1 603 332 265
>>> 1  7512.65 344 1 442 216           NA
>>> 1     5642.667            NA 1 486 217 275
>>> 1     6684.386 341 1 927 698 479
>>> 2     5165.731 999 1 971 311 562
>>> 2 NA 265 1 388 999 512
>>> 2     3259.241 557 2 888 444 777
>>> 2     3288.383 234 2 514            NA 322
>>> 2      1997.878 383 2 409 311           NA
>>> 2       99990.61           NA 2 546 327 728
>>> 2       2655.977          NA 2 523 228 653
>>> 3      3189.49 7777 2 313 456 450
>>> 3      1826.851 287 2 296 412 576
>>> 3      4386.002 352 2 320 251         NA
>>> 3      3295.091 308 2 388 888 396.5
>>> 3      2120.902 526 3 9999 398 888
>>> 3 NA 489 3 677 438 307
>>> 3      2056.123 291 3 555 428 219
>>> 3      1995.088 444 3              NA 319           NA
>>> 3 NA 349 3 479           NA 321
>>> 3      2539.873 333 3 257 406 417
>>>         3 313 334 409
>>>         3 296 465 546
>>>         3 320 180 523
>>>         3 388 999 313
>>>
>>>
>>>
>>> ______________________
>>>
>>>
>>> *AbouEl-Makarim Aboueissa, PhD*
>>>
>>> *Professor, Mathematics and Statistics*
>>> *Graduate Coordinator*
>>>
>>> *Department of Mathematics and Statistics*
>>> *University of Southern Maine*
>>>
>>>        [[alternative HTML version deleted]]
>>>
>>> ______________________________________________
>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> Hello,
>>
>> With the data file you have attached I cannot reproduce any errors, all
>> went well at the first try.
>>
>>
>> library(outliers)
>>
>> fl <- "~/data_for_test.csv"
>> datafortest <- read.csv(fl)
>>
>> # these are not needed to run the test
>> datafortest$factor1 <- as.factor(datafortest$factor1)
>> datafortest$factor2 <- as.factor(datafortest$factor2)
>> str(datafortest)
>> #> 'data.frame':    28 obs. of  7 variables:
>> #>  $ factor1: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 2 2 2 ...
>> #>  $ X      : num  4455 4348 10000 3813 7513 ...
>> #>  $ Y      : int  888 333 618 417 344 NA 341 999 265 557 ...
>> #>  $ factor2: Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 2 ...
>> #>  $ Z      : int  999 475 507 603 442 486 927 971 388 888 ...
>> #>  $ U      : int  NA NA 252 332 216 217 698 311 999 444 ...
>> #>  $ V      : num  999 240 394 265 NA 275 479 562 512 777 ...
>> head(datafortest)
>> #>   factor1        X   Y factor2   Z   U   V
>> #> 1       1 4455.077 888       1 999  NA 999
>> #> 2       1 4348.031 333       1 475  NA 240
>> #> 3       1 9999.789 618       1 507 252 394
>> #> 4       1 3813.139 417       1 603 332 265
>> #> 5       1 7512.650 344       1 442 216  NA
>> #> 6       1 5642.667  NA       1 486 217 275
>>
>> ##### tried to use grubbs.test() on a single column of the dataframe, but
>> ##### still not working
>> grubbs.test(datafortest$X, type = 11)
>> #>
>> #>  Grubbs test for two opposite outliers
>> #>
>> #> data:  datafortest$X
>> #> G = 4.6640014, U = 0.0091756, p-value = 0.02867
>> #> alternative hypothesis: 1826.851 and 99990.608 are outliers
>>
>>
>>
>> Hope this helps,
>>
>> Rui Barradas
>>
>>
> 
Hello,

With this data set the problem seems to be what you want to consider an 
outlier. Types 10 and 11 give radically different results.
 From the help page, section Details:

First test (10) is used to detect if the sample dataset contains one 
outlier, statistically different than the other values. Test is based by 
calculating score of this outlier G (outlier minus mean and divided by 
sd) and comparing it to appropriate critical values. Alternative method 
is calculating ratio of variances of two datasets - full dataset and 
dataset without outlier. The obtained value called U is bound with G by 
simple formula.

Second test (11) is used to check if lowest and highest value are two 
outliers on opposite tails of sample. It is based on calculation of 
ratio of range to standard deviation of the sample.

Third test (20) calculates ratio of variance of full sample and sample 
without two extreme observations. It is used to detect if dataset 
contains two outliers on the same tail.

The results below seem to show that there are two outliers on the right 
tail. Do you have reasons to believe this is true? But that's a 
statistics question, the code runs fine.

library(outliers)

datafortest$factor1 <- as.factor(datafortest$factor1)
datafortest$factor2 <- as.factor(datafortest$factor2)

grubbs.test(datafortest$X, type = 10)
#>
#>  Grubbs test for one outlier
#>
#> data:  datafortest$X
#> G = 2.6106, U = 0.6422, p-value = 0.04389
#> alternative hypothesis: highest value 994455.077 is an outlier

grubbs.test(datafortest$X, type = 11)
#>
#>  Grubbs test for two opposite outliers
#>
#> data:  datafortest$X
#> G = 3.04754, U = 0.63726, p-value = 1
#> alternative hypothesis: 5.088 and 994455.077 are outliers

grubbs.test(datafortest$X, type = 20)
#>
#>  Grubbs test for two outliers
#>
#> data:  datafortest$X
#> U = 0.33892, p-value < 2.2e-16
#> alternative hypothesis: highest values 883295.091 , 994455.077 are 
outliers

Hope this helps,

Rui Barradas