[R] Outlier statistics question

Wed Dec 1 05:19:26 CET 2010

- - - - - -

> From: rvaradhan at jhmi.edu
> To: gunter.berton at gene.com
> Date: Tue, 30 Nov 2010 22:53:43 -0500
> CC: r-help at r-project.org; jahan.mohiuddin at gmail.com
> Subject: Re: [R] Outlier statistics question
>
>
> It is, perhaps, more apt to call the tests of outliers as "tests of outright liars".
>
> "Lies, damned lies, and tests of outliers"

I was going to jump on this but thought I should wait since I don't
have any idea what it is :) Several offhand comments
for discussion that I based on a quick read of the wikipedia page
for this test. First of course if you are worried
about ouliers then try some non-parametric tests( wow I can't
believe I really wrote that LOL as I usually dismiss these as
concessions to a lack of computing power ). Second, you could try some scoring
or partitioning of the data if you really think some points are
better than others but this seems to be passing judgement 
retrospectively that some points must be bad because the others
just have to be right because there are so many of them. I would
not dismiss this approach- if you have , say , an obvious bimodal or
other distro that could be quite informative and you may
reasonably ask what happens only using points associated with one "mode."
Third the real problem here is probably
some logical fallacy that I can't recall right now- you
 using the data to clean the data and inflicting an assumption on it,
you'd at least want an analysis of what that assumption can do. 
For example, take the absurd
case of repeatedly calculating the same estimate of a difference
in means between samples. If you have some p-value, that won't
decrease due to a better test statistic if you keep doing the 
same calculation on the data ( the first i in iid ) 
and expect the noise to descrease  as sqrt(N) with N becoming 
m*N for m passes on the same data. 

The right analysis depends on the question and a lot of
other things but offhand I'm not sure when outlier rejection
of this type would be good as the only analysis. It is always 
interesting to see what happens when you discard points but
that is more a sensitivity issue than anything.

>
> Ravi.
> ____________________________________________________________________
>
> Ravi Varadhan, Ph.D.
> Assistant Professor,
> Division of Geriatric Medicine and Gerontology
> School of Medicine
> Johns Hopkins University
>
> Ph. (410) 502-2619
> email: rvaradhan at jhmi.edu
>
>
> ----- Original Message -----
> From: Bert Gunter 
> Date: Tuesday, November 30, 2010 4:22 pm
> Subject: Re: [R] Outlier statistics question
> To: Jahan 
> Cc: r-help at r-project.org
>
>
> > (Apologies to all. I am weak and could not resist)
> >
> > On Tue, Nov 30, 2010 at 12:15 PM, Jahan  wrote:
> > > I have a statistical question.
> > > The data sets I am working with are right-skewed so I have been
> > > plotting the log transformations of my data.  I am using a Grubbs Test
> > > to detect outliers in the data, but I get different outcomes depending
> > > on whether I run the test on the original data or the log(data).
> >
> > Of course!
> >
> > Here
> > > is one of the problematic sets:
> > >
> > > fgf2p50=c(1.563,2.161,2.529,2.726,2.442,5.047)
> > > stripchart(fgf2p50,vertical=TRUE)
> > > #This next step requires you have the 'outliers' package
> > > library(outliers)
> > > grubbs.test(fgf2p50)
> > > #the output says p<0.05 so 5.047 is an outlier
> > > #Next, I run the test on the log(data)
> > > log10=c(0.194,0.335,0.403,0.436,0.388,0.703)
> > > grubbs.test(log10)
> > > #output is that p>0.05 so we reject that there is an outlier.
> > >
> > > The question is, which outlier test do I accept?
> >
> > Neither.
> >
> > (IMHO) Outlier tests are one of statistics's _bad ideas._ The Grubbs
> > test is ca 1970 . There are many better approaches these days --
> > consult your local statistician -- all of which will depend on
> > answering the question, "What is the question you are trying to
> > answer?"
> >
> > -- Bert
> >
> > >
> > > ______________________________________________
> > > R-help at r-project.org mailing list
> > >
> > > PLEASE do read the posting guide
> > > and provide commented, minimal, self-contained, reproducible code.
> > >
> >
> >
> >
> > --
> > Bert Gunter
> > Genentech Nonclinical Biostatistics
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> >
> > PLEASE do read the posting guide
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.