[R] remove extreme values or winsorize – loop - dataframe

Cecilia Carmo cecilia.carmo at ua.pt
Tue Aug 3 00:20:17 CEST 2010


Thank you for your help but I don't understand how can I 
have a dataframe with the columns: firm, year, industry, 
X1 and X2. Could you help me (again)?


Cecília Carmo


Em Sat, 31 Jul 2010 22:10:38 -0400
  jim holtman <jholtman at gmail.com> escreveu:
> This will split the data by industry & year and then 
>return the values 
> that include the 80%-tile (>=10% & <= 90%)
> 
> # split the data by industry/year
> d.s <- split(data, list(data$industry, data$year), 
>drop=TRUE)
> result <- lapply(d.s, function(.id){
>    # get 10/90% values
>    .limit <- quantile(.id$X1, prob=c(.1, .9))
>    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
> })
> 
> This returns a list of 100 elements for each 
>combination.
> 
> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo 
><cecilia.carmo at ua.pt> wrote:
>> Hi everyone!
>>
>> #I need a loop or a function that creates a X2 variable 
>>that is X1 without
>> the extreme values (or X1 winsorized) by industry and 
>>year.
>>
>> #My reproducible example:
>> firm<-sort(rep(1:1000,10),decreasing=F)
>> year<-rep(1998:2007,1000)
>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>> rep(10,10)),1000)
>> X1<-rnorm(10000)
>> data<-data.frame(firm, industry,year,X1)
>> data
>>
>> The way I’m doing this is very hard. I split my sample 
>>by industry and year,
>> for each industry and year I calculate the 10% and 90% 
>>quantiles, then I
>> create a X2 variable like this:
>>
>> industry1<-subset(data,data$industry==1)
>>
>> ind1year1999<-subset(industry1,industry1$year==1999)
>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>
>> ind1year2000<-subset(industry1,industry1$year==2000)
>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>
>> I repeat this for all years and industries, and then I 
>>merge/bind all again
>> to have a new dataframe with all the columns of the 
>>dataframe «data» plus
>> X2.
>>
>> Could anyone help me doing this in a easier way?
>>
>> Thanks
>> Cecília Carmo
>> Universidade de Aveiro - Portugal
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>>http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, 
>>reproducible code.
>>
> 
> 
> 
> -- 
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
> 
> What is the problem that you are trying to solve?



More information about the R-help mailing list