[R] Removing Outliers Function

David Winsemius dwinsemius at comcast.net
Wed Feb 9 04:05:02 CET 2011


On Feb 8, 2011, at 9:11 PM, kirtau wrote:

>
> I am working on a function that will remove outliers for regression  
> analysis.
> I am stating that a data point is an outlier if its studentized  
> residual is
> above or below 3 and -3, respectively. The code below is what i have  
> thus
> far for the function
>
> x = c(1:20)
> y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
> data1 = data.frame(x,y)
>
>
> rm.outliers = function(dataset,dependent,independent){
>    dataset$predicted = predict(lm(dependent~independent))
>    dataset$stdres = rstudent(lm(dependent~independent))
>    m = 1
>    for(i in 1:length(dataset$stdres)){
>      dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
> dataset$stdres[i] <= -3) {m} else{0}
>    }
>    j = length(which(dataset$outlier_counter >= 1))
>    while(j>=1){
>      print(dataset[which(dataset$outlier_counter >= 1),])
>      dataset = dataset[which(dataset$outlier_counter == 0),]
>      dataset$predicted = predict(lm(dependent~independent))
>      dataset$stdres = rstudent(lm(dependent~independent))
>        m = m+1
>        for(k in 1:length(dataset$stdres)){
>          dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
> dataset$stdres[k] <= -3) {m} else{0}
>        }
>      j = length(which(dataset$outlier_counter >= 1))
>    }
>    return(dataset)
> }
>
> The problem that I run into is that i receive this error when i type
>
> rm.outliers(data1,data1$y,data1$x)
>
> "    x  y predicted   stdres outlier_counter
> 16 16 85  22.98647 24.04862               1
> Error in `$<-.data.frame`(`*tmp*`, "predicted", value =  
> c(0.114285714285714,
> :
>  replacement has 20 rows, data has 19"
>
> Note: the outlier_counter variable is used to state which "round" of  
> the
> loop the datapoint was marked as an outlier.
>
> This would be a HUGE help to me and a few buddies who run a lot of  
> different
> regression tests.

The solution is about 3 or 4 lines of code to make the function, but  
removing outliers like this is simply statistical malpractice. Maybe  
it's a good thing that R has a shallow learning curve.

-- 

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list