[R] Identifying outliers in non-normally distributed data

Jerry Floren jerry.floren at state.mn.us
Fri Jan 8 21:06:45 CET 2010


Thank you Kevin. I'm looking forward to trying your function when I get back
to the office. 

Jerry Floren
Minnesota Department of Agriculture


Kevin Wright-5 wrote:
> 
> Here is a simple function I use.  It uses Median +/- 5.2 * MAD.  If I
> recall, this flags about 1/2000 of values from a true Normal distribution.
> 
> is.outlier = function (x) {
>     # See: Davies, P.L. and Gather, U. (1993).
>     # "The identification of multiple outliers" (with discussion)
>     # J. Amer. Statist. Assoc., 88, 782-801.
> 
>     x <- na.omit(x)
>     lims <- median(x) + c(-1, 1) * 5.2 * mad(x, constant = 1)
>     x < lims[1] | x > lims[2]
> }
> 
> Maybe the function should be called "is.patentable".  I definitely agree
> with Bert's comments.
> 
> Kevin Wright
> 
> 
> 
> On Wed, Dec 30, 2009 at 11:47 AM, Jerry Floren
> <jerry.floren at state.mn.us>wrote:
> 
>>
>> Greetings:
>>
>> I could also use guidance on this topic. I provide manure sample
>> proficiency
>> sets to agricultural labs in the United States and Canada. There are
>> about
>> 65 labs in the program.
>>
>> My data sets are much smaller and typically non-symmetrical with obvious
>> outliers. Usually, there are 30 to 60 sets of data, each with triple
>> replicates (90 to 180 observations).
>>
>> There are definitely outliers caused by the following: reporting in the
>> wrong units, sending in the wrong spreadsheet, entering data in the wrong
>> row, misplacing decimal points, calculation errors, etc. For each
>> analysis,
>> it is common that two to three labs make these types of errors.
>>
>> Since there are replicates, errors like misplaced decimal points are more
>> obvious. However, most of the outlier errors are repeated for all three
>> replicates.
>>
>> I use the median and Median Absolute Deviation (MAD, constant = 1) to
>> flag
>> labs for accuracy. Labs where the average of their three reps deviates
>> more
>> than 2.5 MAD values from the median are flagged for accuracy. With this
>> method, it is not necessary to identify the outliers.
>>
>> A collegue suggested running the data twice. On the first run, outliers
>> more
>> than 4.0 MAD units from the median are removed. On the second run, values
>> exceeding 2.9 times the MAD are flagged for accuracy. I tried this in R
>> with
>> a normally distributed data set of 100,000, and the 4.0 MAD values were
>> nearly identical to the outliers identified with boxplot.
>>
>> With my data set, the flags do not change very much if the data is run
>> one
>> time with the flags set at 2.5 MAD units compared to running the data
>> twice
>> and removing the 4.0 MAD outliers and flagging the second set at 2.9 MAD
>> units. Using either one of these methods might work for you, but I am not
>> sure of the statistical value of these methods.
>>
>> Yours,
>>
>> Jerry Floren
>>
>>
>>
>> Brian G. Peterson wrote:
>> >
>> > John wrote:
>> >> Hello,
>> >>
>> >> I've been searching for a method for identify outliers for quite some
>> >> time now. The complication is that I cannot assume that my data is
>> >> normally distributed nor symmetrical (i.e. some distributions might
>> >> have one longer tail) so I have not been able to find any good tests.
>> >> The Walsh's Test (http://www.statistics4u.info/
>> >> fundsta...liertest.html#), as I understand assumes that the data is
>> >> symmetrical for example.
>> >>
>> >> Also, while I've found some interesting articles:
>> >> http://tinyurl.com/yc7w4oq ("Missing Values, Outliers, Robust
>> >> Statistics & Non-parametric Methods")
>> >> I don't really know what to use.
>> >>
>> >> Any ideas? Any R packages available for this? Thanks!
>> >>
>> >> PS. My data has 1000's of observations..
>> >
>> > Take a look at package 'robustbase', it provides most of the standard
>> > robust
>> > measures and calculations.
>> >
>> > While you didn't say what kind of data you're trying to identify
>> outliers
>> > in,
>> > if it is time series data the function Return.clean in
>> > PerformanceAnalytics may
>> > be useful.
>> >
>> > Regards,
>> >
>> >    - Brian
>> >
>> >
>> > --
>> > Brian G. Peterson
>> > http://braverock.com/brian/
>> > Ph: 773-459-4973
>> > IM: bgpbraverock
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>>
>> --
>> View this message in context:
>> http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p991062.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
> 
> 
> 
> -- 
> Kevin Wright
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 

-- 
View this message in context: http://n4.nabble.com/Identifying-outliers-in-non-normally-distributed-data-tp987921p1009958.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list