[R] Identifying outliers in non-normally distributed data

Wed Dec 30 18:47:03 CET 2009

Greetings:

I could also use guidance on this topic. I provide manure sample proficiency
sets to agricultural labs in the United States and Canada. There are about
65 labs in the program.

My data sets are much smaller and typically non-symmetrical with obvious
outliers. Usually, there are 30 to 60 sets of data, each with triple
replicates (90 to 180 observations).

There are definitely outliers caused by the following: reporting in the
wrong units, sending in the wrong spreadsheet, entering data in the wrong
row, misplacing decimal points, calculation errors, etc. For each analysis,
it is common that two to three labs make these types of errors. 

Since there are replicates, errors like misplaced decimal points are more
obvious. However, most of the outlier errors are repeated for all three
replicates. 

I use the median and Median Absolute Deviation (MAD, constant = 1) to flag
labs for accuracy. Labs where the average of their three reps deviates more
than 2.5 MAD values from the median are flagged for accuracy. With this
method, it is not necessary to identify the outliers.

A collegue suggested running the data twice. On the first run, outliers more
than 4.0 MAD units from the median are removed. On the second run, values
exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R with
a normally distributed data set of 100,000, and the 4.0 MAD values were
nearly identical to the outliers identified with boxplot.

With my data set, the flags do not change very much if the data is run one
time with the flags set at 2.5 MAD units compared to running the data twice
and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD
units. Using either one of these methods might work for you, but I am not
sure of the statistical value of these methods.

Yours,

Jerry Floren

Brian G. Peterson wrote:
> 
> John wrote:
>> Hello,
>> 
>> I've been searching for a method for identify outliers for quite some
>> time now. The complication is that I cannot assume that my data is
>> normally distributed nor symmetrical (i.e. some distributions might
>> have one longer tail) so I have not been able to find any good tests.
>> The Walsh's Test (http://www.statistics4u.info/
>> fundsta...liertest.html#), as I understand assumes that the data is
>> symmetrical for example.
>> 
>> Also, while I've found some interesting articles:
>> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust
>> Statistics & Non-parametric Methods")
>> I don't really know what to use.
>> 
>> Any ideas? Any R packages available for this? Thanks!
>> 
>> PS. My data has 1000's of observations..
> 
> Take a look at package 'robustbase', it provides most of the standard
> robust 
> measures and calculations.
> 
> While you didn't say what kind of data you're trying to identify outliers
> in, 
> if it is time series data the function Return.clean in
> PerformanceAnalytics may 
> be useful.
> 
> Regards,
> 
>    - Brian
> 
> 
> -- 
> Brian G. Peterson
> http://braverock.com/brian/
> Ph: 773-459-4973
> IM: bgpbraverock
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p991062.html
Sent from the R help mailing list archive at Nabble.com.