[R] Finding suspicious data points?

Carl Witthoft carl at witthoft.com
Thu Jan 26 15:00:13 CET 2012


According to the help file for 'outlier'  ,  (quoting)

x a data sample, vector in most cases. If argument is a dataframe, then 
outlier is
calculated for each column by sapply. The same behavior is applied by apply
when the matrix is given.  (endquote)

Looks like you could create a matrix that looks like an "upper 
triangular" like

1	1  	1
NA	2	2
NA	NA	3

and see the results.  However, since 'outlier' just returns the value 
furthest from the mean, this doesn't really provide much information. 
If I were to write a function to find "genuine" outliers,  I would do 
something like

x[ abs(x-mean(x)) > 3*sd(x)] , thus returning all values more than 
3-sigma from the mean.



<quote>

I would like to find data points that at least should be checked one 
more time before I process them further.
I've had a look at the outliers package for this, and the outliers 
function in that package, but it appears to only return one value.

An example:

 > outlier(c(1:3,rnorm(1000,mean=100000,sd=300)))
[1] 1

I think at least 1,2 and 3 should be checked in this case.

Any ideas on how to achieve this in R?

Actually, the real data I will be investigating consist of vector norms 
and angles (in an attempt to identify either very short, very long 
vectors, or vectors pointing in an odd angle for the category to which 
it has been assigned) so a 2D method would be even better.

I would much appreciate any help I can get on this,


-- 

Sent from my Cray XK6
"Pendeo-navem mei anguillae plena est."



More information about the R-help mailing list