[R] outliers/interval data extraction

Jason Turner jasont at indigoindustrial.co.nz
Thu Feb 20 19:10:03 CET 2003


On Thu, Feb 20, 2003 at 06:37:48PM -0500, Rado Bonk wrote:
> Dear R-users,
> 
> I have two outliers related questions.
> 
> I.
> I have a vector consisting of 69 values.
> 
> mean = 0.00086
> SD = 0.02152
> 
> The shape of EDA graphics (boxplots, density plots) is heavily distorted
> due to outliers. How to define the interval for outliers exception? Is
> <2SD - mean + 2SD> interval a correct approach?

Yikes.  

There's been a lot of discussion of this over the years; these
discussions usually  generate more heat than light.

<personal bias>
Throwing away outliers without further investigation is often
considered a bad idea.  The argument is that you get into a situation
where you are rejecting data because it doesn't fit the model, which
is a strange approach.  The most famous case of this was satelite
data on ozone thickness over Antarctica - the ozone hole was missed
for years because of an automatic outlier-rejection routine in the
data analysis.  If those outliers hadn't been rejected, the steps
taken could've been done sooner, avoiding a lot of dammage.

My own work is in industrial process control - if I ignored outliers,
I'd make an awful lot of very bad mistakes, and wouldn't have a job
for long. 

Outliers aren't necessarily wrong - sometimes the data is trying to
tell you something.
</personal bias>

Robust summaries are another way.  Check out the help pages for mad(),
IQR(), fivenum().  

Having said that, if you want to compare outlier-free data with your
raw data to help enlighten you about where those outliers might be
comming from, something like this might help...

ss <- mad(myvec)
mm <- median(myvec)
ind <- (myvec > mm - 3*ss & myvec < mm + 3*ss)
# or
ind2 <- (myvec > quantile(myvec,0.025) & myvec <quantile(myvec,0.975))

boxplot(myvec[ind])
boxplot(myvec[ind2])

Cheers

Jason
-- 
Indigo Industrial Controls Ltd.
64-21-343-545
jasont at indigoindustrial.co.nz




More information about the R-help mailing list