[R] Outliers Help

Fri Aug 30 14:43:23 CEST 2013

HI,

Also,

dd1<-matrix(cbind(D[,1],(D[-c(1:2)]/D[,2]>4)*1),dimnames=NULL,ncol=7)
identical(dd,dd1)
#[1] TRUE
A.K.

----- Original Message -----
From: Jose Iparraguirre <Jose.Iparraguirre at ageuk.org.uk>
To: Mª Teresa Martinez Soriano <teresamarso at hotmail.com>; "r-help at r-project.org" <r-help at r-project.org>
Cc: 
Sent: Friday, August 30, 2013 5:39 AM
Subject: Re: [R] Outliers Help

Hi Ma Teresa,

Sorry, but I can't understand what you're trying to achieve.
On a statistical note, I'd tend to think more in terms of medians and would think hard before replacing any outliers, but that's another matter.

Here I created the dataframe dd with the means column of D in its first column, and then populated with a 1 whenever the value of D for that cell was greater than 4 times the mean for that row -your definition of 'outlier'. 
> dd <- rep(0,15*7)
> dim(dd) <- c(15,7)
> dd[,1]<- D[,1]
> for (i in 1:15){
+ for (j in 2:7){
+ dd[i,j] <- D[i,(j+1)]/D[i,2]>4
+ }
+ }
> dd
       [,1] [,2] [,3] [,4] [,5] [,6] [,7]
[1,]  1108    0    0    0    0    0    0
[2,]  1479   NA   NA    0    1    0    0
[3,]  1591    0    0    0    0    0    0
[4,]  3408    0    0    0    0    0    0
[5,]  3423   NA   NA   NA    1    0    0
[6,]  3872    0    0    0    0    0    1
[7,]  5823    0    0    0    0    0    0
[8,]  6051   NA   NA   NA   NA    0    0
[9,]  8099    0    0    0    0    0    0
[10,]  8100   NA   NA   NA   NA    0    0
[11,] 10640    1    1    1    0    0    0
[12,] 12600    0    0    0    0    0    0
[13,] 14680    0    0    1    0    0    1
[14,] 14698    0    0    0    0    0    0
[15,] 17143    0    0    0    0    0    0

So, you encounter four situations:

a) as in row 2, you have an outlier preceded and followed by values
b) as in row 5, you have an outlier preceded by an NA
c) as in row 6, there is an outlier in the last column
d) as in row 11, there are two or more consecutive outliers

The replacement rule you described would only apply to situations a) (ie replacing the outlier by the mean of the preceding and subsequent values), and b) (replacing it by the mean for the row).
But what of situations c) and d)?

And, because this is just a chunk of a bigger dataset, you can also get an outlier in the first column, followed by a number. Again, your rule has not accounted for this situation either.

Hope this helps,

José

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Mª Teresa Martinez Soriano
Sent: 30 August 2013 09:13
To: r-help at r-project.org
Subject: [R] Outliers Help

This is my a part of my data set 

> D[1:15,c(1,5:10)]

       X.      media IE.2005 IE.2006 IE.2007 IE.2008 IE.2009 IE.2010
1   1108   22.00000    60.0      39     4.0     8.0    16.0     5.0
2   1479  110.00000      NA      NA    53.0  1166.0   344.8   110.0
3   1591   86.60000   247.0      87    95.0    94.0    81.0    76.0
4   3408  807.00000   302.0     322   621.0  1071.0  1301.0  1225.0
5   3423    9.00000      NA      NA      NA   410.8     7.0    11.0
6   3872  103.25000   288.6     113   116.0    90.0    94.0 12036.6
7   5823   73.00000   117.0      70    80.0    74.0    69.0    72.0
8   6051   73.00000      NA      NA      NA      NA    60.0    86.0
9   8099  125.16667   196.0     161   150.0    94.0    72.0    78.0
10  8100   70.00000      NA      NA      NA      NA    48.0    92.0
11 10640   67.33333  1256.6    1152   664.2    74.0    77.0    51.0
12 12600 2417.00000  1960.0    2383  2453.0  2506.0  2758.0  2442.0
13 14680   38.00000    30.0      61   373.6    42.0    19.0   220.8
14 14698  698.16667   553.0     664   847.0   800.0   679.0   646.0
15 17143  392.16667   323.0     322   434.0   383.0   459.0   432.0

I have done multiple imputation and now I have some outliers which I would like to replace with the mean of this row or if it is possible with the mean of the previos and the next value of this row, I mean for instance:

value 1 - Outlier- Value 2

I would like to replace the outlier with the mean of value 1 and value2, the problem is that this values could be NA ( NA after the imputation because they don't exist), in this case I would like to replace outlier with the mean of the row.

An other problem I have is to detect correctly outlier values, for instance in this example of data set for X=3872 and IE.2010, we can see an outlier, I have thought to compare the values with the mean ( column media) 

I have tried to do this code

D<-datos[, c(1,16:24)]
m<-as.matrix(D) 
for( i in 1: nrow(D))

{

   for( j in 5:(ncol(D)-1)) # I would change this in the new data set, because I will have more years than 2010
    {  
   if(!is.na(m[i,j])&& !is.na (m[i,j+1])&&!is.na(m[i,j-1])&&!is.na(m[i,2])&&((m[i,j]/m[i,2])>4)){m[m[i,j]]<- (m[i,j-1]+m[i,j+1])/2 # Here I would like to find the values that are much more bigger than the mean of this row, 
    #if( !is.na(m[i,j])
    # and replace them by the mean of the previous and the next values of the same row.

  } 
  }
}
D<-as.data.frame(m)

But I get a data.frame that I had previously, it changes nothing

I accept any idea.
Thanks a lot, Teresa                           
    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

The Wireless from Age UK | Radio for grown-ups.

www.ageuk.org.uk/thewireless

If you’re looking for a radio station that offers real variety, tune in to The Wireless from Age UK. 
Whether you choose to listen through the website at www.ageuk.org.uk/thewireless, on digital radio (currently available in London and Yorkshire) or through our TuneIn Radio app, you can look forward to an inspiring mix of music, conversation and useful information 24 hours a day.

-------------------------------
Age UK is a registered charity and company limited by guarantee, (registered charity number 1128267, registered company number 6825798). 
Registered office: Tavis House, 1-6 Tavistock Square, London WC1H 9NA.

For the purposes of promoting Age UK Insurance, Age UK is an Appointed Representative of Age UK Enterprises Limited, Age UK is an Introducer 
Appointed Representative of JLT Benefit Solutions Limited and Simplyhealth Access for the purposes of introducing potential annuity and health 
cash plans customers respectively.  Age UK Enterprises Limited, JLT Benefit Solutions Limited and Simplyhealth Access are all authorised and 
regulated by the Financial Services Authority. 
------------------------------

This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are 
addressed. If you receive a message in error, please advise the sender and delete immediately.

Except where this email is sent in the usual course of our business, any opinions expressed in this email are those of the author and do not 
necessarily reflect the opinions of Age UK or its subsidiaries and associated companies. Age UK monitors all e-mail transmissions passing 
through its network and may block or modify mails which are deemed to be unsuitable.

Age Concern England (charity number 261794) and Help the Aged (charity number 272786) and their trading and other associated companies merged 
on 1st April 2009.  Together they have formed the Age UK Group, dedicated to improving the lives of people in later life.  The three national 
Age Concerns in Scotland, Northern Ireland and Wales have also merged with Help the Aged in these nations to form three registered charities: 
Age Scotland, Age NI, Age Cymru.

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.