[R] detect and replace outliers by the average

Thu Apr 20 21:16:54 CEST 2023

Às 19:58 de 20/04/2023, Rui Barradas escreveu:
> Às 19:46 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu:
>> Hi Rui:
>>
>>
>> here is the dataset
>>
>> factor x1 x2
>> 0 700 700
>> 0 700 500
>> 0 470 470
>> 0 710 560
>> 0 5555 520
>> 0 610 720
>> 0 710 670
>> 0 610 9999
>> 1 690 620
>> 1 580 540
>> 1 690 690
>> 1 NA 401
>> 1 450 580
>> 1 700 700
>> 1 400 8888
>> 1 6666 600
>> 1 500 400
>> 1 680 650
>> 2 117 63
>> 2 120 68
>> 2 130 73
>> 2 120 69
>> 2 125 54
>> 2 999 70
>> 2 165 62
>> 2 130 987
>> 2 123 70
>> 2 78
>> 2 98
>> 2 5
>> 2 321 NA
>>
>> with many thanks
>> abou
>> ______________________
>>
>>
>> *AbouEl-Makarim Aboueissa, PhD*
>>
>> *Professor, Mathematics and Statistics*
>> *Graduate Coordinator*
>>
>> *Department of Mathematics and Statistics*
>> *University of Southern Maine*
>>
>>
>>
>> On Thu, Apr 20, 2023 at 2:44 PM Rui Barradas <ruipbarradas using sapo.pt> 
>> wrote:
>>
>>> Às 19:36 de 20/04/2023, AbouEl-Makarim Aboueissa escreveu:
>>>> Dear All:
>>>>
>>>>
>>>>
>>>> *Re:* detect and replace outliers by the average
>>>>
>>>>
>>>>
>>>> The dataset, please see attached, contains a group factoring column “
>>>> *factor*” and two columns of data “x1” and “x2” with some NA values. I
>>> need
>>>> some help to detect the outliers and replace it and the NAs with the
>>>> average within each level (0,1,2) for each variable “x1” and “x2”.
>>>>
>>>>
>>>>
>>>> I tried the below code, but it did not accomplish what I want to do.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> data<-read.csv("G:/20-Spring_2023/Outliers/data.csv", header=TRUE)
>>>>
>>>> data
>>>>
>>>> replace_outlier_with_mean <- function(x) {
>>>>
>>>>     replace(x, x %in% boxplot.stats(x)$out, mean(x, na.rm=TRUE))  
>>>> #### ,
>>>> na.rm=TRUE NOT working
>>>>
>>>> }
>>>>
>>>> data[] <- lapply(data, replace_outlier_with_mean)
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> Thank you all very much for your help in advance.
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> with many thanks
>>>>
>>>> abou
>>>>
>>>>
>>>> ______________________
>>>>
>>>>
>>>> *AbouEl-Makarim Aboueissa, PhD*
>>>>
>>>> *Professor, Mathematics and Statistics*
>>>> *Graduate Coordinator*
>>>>
>>>> *Department of Mathematics and Statistics*
>>>> *University of Southern Maine*
>>>> ______________________________________________
>>>> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>> Hello,
>>>
>>> There is no data set attached, see the posting guide on what file
>>> extensions are allowed as attachments.
>>>
>>> As for the question, try to compute mean(x, na.rm = TRUE)  first, then
>>> use this value in the replace instruction. Without data I'm just 
>>> guessing.
>>>
>>> Hope this helps,
>>>
>>> Rui Barradas
>>>
>>>
>>
> Hello,
> 
> Here is a way. It uses ave in the function to group the data by the factor.
> 
> 
> df1 <- "factor x1 x2
> 0 700 700
> 0 700 500
> 0 470 470
> 0 710 560
> 0 5555 520
> 0 610 720
> 0 710 670
> 0 610 9999
> 1 690 620
> 1 580 540
> 1 690 690
> 1 NA 401
> 1 450 580
> 1 700 700
> 1 400 8888
> 1 6666 600
> 1 500 400
> 1 680 650
> 2 117 63
> 2 120 68
> 2 130 73
> 2 120 69
> 2 125 54
> 2 999 70
> 2 165 62
> 2 130 987
> 2 123 70
> 2 78 NA
> 2 98 NA
> 2 5 NA
> 2 321 NA"
> df1 <- read.table(text = df1, header = TRUE,
>                    colClasses = c("factor", "numeric", "numeric"))
> 
> 
> replace_outlier_with_mean <- function(x, f) {
>    ave(x, f, FUN = \(y) {
>      i <- is.na(y) | y %in% boxplot.stats(y, do.conf = FALSE)$out
>      y[i] <- mean(y, na.rm = TRUE)
>      y
>    })
> }
> 
> lapply(df1[-1], replace_outlier_with_mean, f = df1$factor)
> #> $x1
> #>  [1]  700.0000  700.0000  470.0000  710.0000 1258.1250  610.0000 
> 710.0000
> #>  [8]  610.0000  690.0000  580.0000  690.0000 1261.7778  450.0000 
> 700.0000
> #> [15]  400.0000 1261.7778  500.0000  680.0000  117.0000  120.0000 
> 130.0000
> #> [22]  120.0000  125.0000  194.6923  194.6923  130.0000  123.0000 
> 194.6923
> #> [29]   98.0000  194.6923  194.6923
> #>
> #> $x2
> #>  [1]  700.0000  500.0000  470.0000  560.0000  520.0000  720.0000 
> 670.0000
> #>  [8] 1767.3750  620.0000  540.0000  690.0000  401.0000  580.0000 
> 700.0000
> #> [15] 1406.9000  600.0000  400.0000  650.0000   63.0000   68.0000 73.0000
> #> [22]   69.0000   54.0000   70.0000   62.0000  168.4444   70.0000 
> 168.4444
> #> [29]  168.4444  168.4444  168.4444
> 
> 
> Hope this helps,
> 
> Rui Barradas
> 
> ______________________________________________
> R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Hello,

A simpler version of the same function, this time with replace(), like 
the OP. The results are identical().

replace_outlier_with_mean <- function(x, f) {
   ave(x, f, FUN = \(y) {
     i <- is.na(y) | y %in% boxplot.stats(y, do.conf = FALSE)$out
     replace(y, i, mean(y, na.rm = TRUE))
   })
}

Also, my data copy&paste from a previous mail, is wrong, there are 3 
NA's in the wrong column. The following is better.

df1 <- read.table("data.txt", header = TRUE, sep = "\t",
                   colClasses = c("factor", "numeric", "numeric"))

Hope this helps,

Rui Barradas