[R] remove extreme values or winsorize – loop - dataframe

Tue Aug 3 00:42:27 CEST 2010

This is just following up with the example data you sent.  This will
create a list 'result' that will have the subset of data between the
10% & 90%-tiles of the data:

> #My reproducible example:
> firm<-sort(rep(1:1000,10),decreasing=F)
> year<-rep(1998:2007,1000)
> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
+ rep(10,10)),1000)
> X1<-rnorm(10000)
> data<-data.frame(firm, industry,year,X1)
> # split the data by industry/year
> d.s <- split(data, list(data$industry, data$year), drop=TRUE)
> result <- lapply(d.s, function(.id){
+    # get 10/90% values
+    .limit <- quantile(.id$X1, prob=c(.1, .9))
+    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
+ })
> str(result)
List of 100
 $ 1.1998 :'data.frame':        800 obs. of  4 variables:
  ..$ firm    : int [1:800] 1 21 31 41 51 61 71 81 91 111 ...
  ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998
1998 1998 ...
  ..$ X1      : num [1:800] 0.659 -0.105 -0.617 0.342 -1.077 ...
 $ 2.1998 :'data.frame':        800 obs. of  4 variables:
  ..$ firm    : int [1:800] 2 32 42 52 62 72 102 112 132 162 ...
  ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ...
  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998
1998 1998 ...
  ..$ X1      : num [1:800] -1.1044 -0.0666 -0.9184 0.3469 -0.2348 ...

You can see that the 'name' of the list element is the industry.year
combination; this can also be seen in the data.

On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:
> Thank you for your help but I don't understand how can I have a dataframe
> with the columns: firm, year, industry, X1 and X2. Could you help me
> (again)?
>
>
> Cecília Carmo
>
>
> Em Sat, 31 Jul 2010 22:10:38 -0400
>  jim holtman <jholtman at gmail.com> escreveu:
>>
>> This will split the data by industry & year and then return the values
>> that include the 80%-tile (>=10% & <= 90%)
>>
>> # split the data by industry/year
>> d.s <- split(data, list(data$industry, data$year), drop=TRUE)
>> result <- lapply(d.s, function(.id){
>>   # get 10/90% values
>>   .limit <- quantile(.id$X1, prob=c(.1, .9))
>>   subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
>> })
>>
>> This returns a list of 100 elements for each combination.
>>
>> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo <cecilia.carmo at ua.pt>
>> wrote:
>>>
>>> Hi everyone!
>>>
>>> #I need a loop or a function that creates a X2 variable that is X1
>>> without
>>> the extreme values (or X1 winsorized) by industry and year.
>>>
>>> #My reproducible example:
>>> firm<-sort(rep(1:1000,10),decreasing=F)
>>> year<-rep(1998:2007,1000)
>>>
>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>>> rep(10,10)),1000)
>>> X1<-rnorm(10000)
>>> data<-data.frame(firm, industry,year,X1)
>>> data
>>>
>>> The way I’m doing this is very hard. I split my sample by industry and
>>> year,
>>> for each industry and year I calculate the 10% and 90% quantiles, then I
>>> create a X2 variable like this:
>>>
>>> industry1<-subset(data,data$industry==1)
>>>
>>> ind1year1999<-subset(industry1,industry1$year==1999)
>>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
>>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
>>>
>>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>
>>> ind1year2000<-subset(industry1,industry1$year==2000)
>>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
>>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
>>>
>>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>
>>> I repeat this for all years and industries, and then I merge/bind all
>>> again
>>> to have a new dataframe with all the columns of the dataframe «data» plus
>>> X2.
>>>
>>> Could anyone help me doing this in a easier way?
>>>
>>> Thanks
>>> Cecília Carmo
>>> Universidade de Aveiro - Portugal
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
>
>

-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?