[R] Removing Outliers Function

Wed Feb 9 04:36:15 CET 2011

David,

Please allow me to digress a lot here.  You are one of the few (inlcuding yours truly!) that uses the phrase "shallow learning curve" to indicate difficulty of learning (I assume this is what you meant). I always felt that "steep learning curve" was incorrect.  If you plotted the amount of learning on the Y-axis and time on the X-axis, a steep learning curve means that one learns very quickly, but this is just the opposite of what is actually meant. 

Best,
Ravi.
____________________________________________________________________

Ravi Varadhan, Ph.D.
Assistant Professor,
Division of Geriatric Medicine and Gerontology
School of Medicine
Johns Hopkins University

Ph. (410) 502-2619
email: rvaradhan at jhmi.edu

----- Original Message -----
From: David Winsemius <dwinsemius at comcast.net>
Date: Tuesday, February 8, 2011 10:09 pm
Subject: Re: [R] Removing Outliers Function
To: kirtau <kirtau at live.com>
Cc: r-help at r-project.org

>  On Feb 8, 2011, at 9:11 PM, kirtau wrote:
>  
>  >
>  >I am working on a function that will remove outliers for regression 
> analysis.
>  >I am stating that a data point is an outlier if its studentized 
> residual is
>  >above or below 3 and -3, respectively. The code below is what i have 
> thus
>  >far for the function
>  >
>  >x = c(1:20)
>  >y = c(1,3,4,2,5,6,18,8,10,8,11,13,14,14,15,85,17,19,19,20)
>  >data1 = data.frame(x,y)
>  >
>  >
>  >rm.outliers = function(dataset,dependent,independent){
>  >   dataset$predicted = predict(lm(dependent~independent))
>  >   dataset$stdres = rstudent(lm(dependent~independent))
>  >   m = 1
>  >   for(i in 1:length(dataset$stdres)){
>  >     dataset$outlier_counter[i] = if(dataset$stdres[i] >= 3 |
>  >dataset$stdres[i] <= -3) {m} else{0}
>  >   }
>  >   j = length(which(dataset$outlier_counter >= 1))
>  >   while(j>=1){
>  >     print(dataset[which(dataset$outlier_counter >= 1),])
>  >     dataset = dataset[which(dataset$outlier_counter == 0),]
>  >     dataset$predicted = predict(lm(dependent~independent))
>  >     dataset$stdres = rstudent(lm(dependent~independent))
>  >       m = m+1
>  >       for(k in 1:length(dataset$stdres)){
>  >         dataset$outlier_counter[k] = if(dataset$stdres[k] >= 3 |
>  >dataset$stdres[k] <= -3) {m} else{0}
>  >       }
>  >     j = length(which(dataset$outlier_counter >= 1))
>  >   }
>  >   return(dataset)
>  >}
>  >
>  >The problem that I run into is that i receive this error when i type
>  >
>  >rm.outliers(data1,data1$y,data1$x)
>  >
>  >"    x  y predicted   stdres outlier_counter
>  >16 16 85  22.98647 24.04862               1
>  >Error in `$<-.data.frame`(`*tmp*`, "predicted", value = c(0.114285714285714,
>  >:
>  > replacement has 20 rows, data has 19"
>  >
>  >Note: the outlier_counter variable is used to state which "round" of 
> the
>  >loop the datapoint was marked as an outlier.
>  >
>  >This would be a HUGE help to me and a few buddies who run a lot of different
>  >regression tests.
>  
>  The solution is about 3 or 4 lines of code to make the function, but 
> removing outliers like this is simply statistical malpractice. Maybe 
> it's a good thing that R has a shallow learning curve.
>  
>  -- 
>  
>  David Winsemius, MD
>  West Hartford, CT
>  
>  ______________________________________________
>  R-help at r-project.org mailing list
>  
>  PLEASE do read the posting guide 
>  and provide commented, minimal, self-contained, reproducible code.