[R] remove extreme values or winsorize – loop - dataframe

Cecilia Carmo cecilia.carmo at ua.pt
Tue Aug 3 00:54:12 CEST 2010


Thank you again, but I think I need to do some homework 
about the split function, because I'm not understanding it 
very well.
Besides, I think I still have a problem. I also need X2 = 
X1 winsorized: X2 is equal to X1 between 10%-90%, and is 
equal to the 10% value when < 10% and equal to the 90% 
value when it is >.
Could you help me?

Thank you
Cecília

Em Mon, 2 Aug 2010 18:42:27 -0400
  jim holtman <jholtman at gmail.com> escreveu:
> This is just following up with the example data you 
>sent.  This will
> create a list 'result' that will have the subset of data 
>between the
> 10% & 90%-tiles of the data:
> 
>> #My reproducible example:
>> firm<-sort(rep(1:1000,10),decreasing=F)
>> year<-rep(1998:2007,1000)
>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
> + rep(10,10)),1000)
>> X1<-rnorm(10000)
>> data<-data.frame(firm, industry,year,X1)
>> # split the data by industry/year
>> d.s <- split(data, list(data$industry, data$year), 
>>drop=TRUE)
>> result <- lapply(d.s, function(.id){
> +    # get 10/90% values
> +    .limit <- quantile(.id$X1, prob=c(.1, .9))
> +    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
> + })
>> str(result)
> List of 100
> $ 1.1998 :'data.frame':        800 obs. of  4 variables:
>  ..$ firm    : int [1:800] 1 21 31 41 51 61 71 81 91 111 
>...
>  ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ...
>  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 
>1998 1998
> 1998 1998 ...
>  ..$ X1      : num [1:800] 0.659 -0.105 -0.617 0.342 
>-1.077 ...
> $ 2.1998 :'data.frame':        800 obs. of  4 variables:
>  ..$ firm    : int [1:800] 2 32 42 52 62 72 102 112 132 
>162 ...
>  ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ...
>  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 
>1998 1998
> 1998 1998 ...
>  ..$ X1      : num [1:800] -1.1044 -0.0666 -0.9184 
>0.3469 -0.2348 ...
> 
> You can see that the 'name' of the list element is the 
>industry.year
> combination; this can also be seen in the data.
> 
> On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo 
><cecilia.carmo at ua.pt> wrote:
>> Thank you for your help but I don't understand how can I 
>>have a dataframe
>> with the columns: firm, year, industry, X1 and X2. Could 
>>you help me
>> (again)?
>>
>>
>> Cecília Carmo
>>
>>
>> Em Sat, 31 Jul 2010 22:10:38 -0400
>>  jim holtman <jholtman at gmail.com> escreveu:
>>>
>>> This will split the data by industry & year and then 
>>>return the values
>>> that include the 80%-tile (>=10% & <= 90%)
>>>
>>> # split the data by industry/year
>>> d.s <- split(data, list(data$industry, data$year), 
>>>drop=TRUE)
>>> result <- lapply(d.s, function(.id){
>>>   # get 10/90% values
>>>   .limit <- quantile(.id$X1, prob=c(.1, .9))
>>>   subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
>>> })
>>>
>>> This returns a list of 100 elements for each 
>>>combination.
>>>
>>> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo 
>>><cecilia.carmo at ua.pt>
>>> wrote:
>>>>
>>>> Hi everyone!
>>>>
>>>> #I need a loop or a function that creates a X2 variable 
>>>>that is X1
>>>> without
>>>> the extreme values (or X1 winsorized) by industry and 
>>>>year.
>>>>
>>>> #My reproducible example:
>>>> firm<-sort(rep(1:1000,10),decreasing=F)
>>>> year<-rep(1998:2007,1000)
>>>>
>>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>>>> rep(10,10)),1000)
>>>> X1<-rnorm(10000)
>>>> data<-data.frame(firm, industry,year,X1)
>>>> data
>>>>
>>>> The way I’m doing this is very hard. I split my sample 
>>>>by industry and
>>>> year,
>>>> for each industry and year I calculate the 10% and 90% 
>>>>quantiles, then I
>>>> create a X2 variable like this:
>>>>
>>>> industry1<-subset(data,data$industry==1)
>>>>
>>>> ind1year1999<-subset(industry1,industry1$year==1999)
>>>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
>>>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
>>>>
>>>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>
>>>> ind1year2000<-subset(industry1,industry1$year==2000)
>>>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
>>>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
>>>>
>>>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>
>>>> I repeat this for all years and industries, and then I 
>>>>merge/bind all
>>>> again
>>>> to have a new dataframe with all the columns of the 
>>>>dataframe «data» plus
>>>> X2.
>>>>
>>>> Could anyone help me doing this in a easier way?
>>>>
>>>> Thanks
>>>> Cecília Carmo
>>>> Universidade de Aveiro - Portugal
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide
>>>> http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, 
>>>>reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>
>>
>>
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?



More information about the R-help mailing list