[R] remove extreme values or winsorize – loop - dataframe

jim holtman jholtman at gmail.com
Tue Aug 3 01:13:25 CEST 2010


I had to look up winsorized; this should do it:

> #My reproducible example:
> firm<-sort(rep(1:1000,10),decreasing=F)
> year<-rep(1998:2007,1000)
> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
+ rep(10,10)),1000)
> X1<-rnorm(10000)
> data<-data.frame(firm, industry,year,X1)
> # split the data by industry/year
> d.s <- split(data, list(data$industry, data$year), drop=TRUE)
> result <- lapply(d.s, function(.id){
+    # get 10/90% values
+    .limit <- quantile(.id$X1, prob=c(.1, .9))
+    .id$X1[.id$X1 < .limit[1]] <- .limit[1]
+    .id$X1[.id$X1 > .limit[2]] <- .limit[2]
+    .id
+ })
>
>
> str(result)
List of 100
 $ 1.1998 :'data.frame':        1000 obs. of  4 variables:
  ..$ firm    : int [1:1000] 1 11 21 31 41 51 61 71 81 91 ...
  ..$ industry: num [1:1000] 1 1 1 1 1 1 1 1 1 1 ...
  ..$ year    : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998
1998 1998 ...
  ..$ X1      : num [1:1000] -0.9807 1.0728 1.1759 0.0238 0.976 ...
 $ 2.1998 :'data.frame':        1000 obs. of  4 variables:
  ..$ firm    : int [1:1000] 2 12 22 32 42 52 62 72 82 92 ...
  ..$ industry: num [1:1000] 2 2 2 2 2 2 2 2 2 2 ...
  ..$ year    : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998
1998 1998 ...
  ..$ X1      : num [1:1000] 0.874 -0.213 -0.971 1.062 -0.099 ...
 $ 3.1998 :'data.frame':        1000 obs. of  4 variables:
  ..$ firm    : int [1:1000] 3 13 23 33 43 53 63 73 83 93 ...
  ..$ industry: num [1:1000] 3 3 3 3 3 3 3 3 3 3 ...
  ..$ year    : int [1:1000] 1998 1998 1998 1998 1998 1998 1998 1998
1998 1998 ...
  ..$ X1      : num [1:1000] -0.3051 -0.0775 -1.2062 -0.8826 0.6671 ...


On Mon, Aug 2, 2010 at 6:54 PM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:
> Thank you again, but I think I need to do some homework about the split
> function, because I'm not understanding it very well.
> Besides, I think I still have a problem. I also need X2 = X1 winsorized: X2
> is equal to X1 between 10%-90%, and is equal to the 10% value when < 10% and
> equal to the 90% value when it is >.
> Could you help me?
>
> Thank you
> Cecília
>
> Em Mon, 2 Aug 2010 18:42:27 -0400
>  jim holtman <jholtman at gmail.com> escreveu:
>>
>> This is just following up with the example data you sent.  This will
>> create a list 'result' that will have the subset of data between the
>> 10% & 90%-tiles of the data:
>>
>>> #My reproducible example:
>>> firm<-sort(rep(1:1000,10),decreasing=F)
>>> year<-rep(1998:2007,1000)
>>>
>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>>
>> + rep(10,10)),1000)
>>>
>>> X1<-rnorm(10000)
>>> data<-data.frame(firm, industry,year,X1)
>>> # split the data by industry/year
>>> d.s <- split(data, list(data$industry, data$year), drop=TRUE)
>>> result <- lapply(d.s, function(.id){
>>
>> +    # get 10/90% values
>> +    .limit <- quantile(.id$X1, prob=c(.1, .9))
>> +    subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
>> + })
>>>
>>> str(result)
>>
>> List of 100
>> $ 1.1998 :'data.frame':        800 obs. of  4 variables:
>>  ..$ firm    : int [1:800] 1 21 31 41 51 61 71 81 91 111 ...
>>  ..$ industry: num [1:800] 1 1 1 1 1 1 1 1 1 1 ...
>>  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998
>> 1998 1998 ...
>>  ..$ X1      : num [1:800] 0.659 -0.105 -0.617 0.342 -1.077 ...
>> $ 2.1998 :'data.frame':        800 obs. of  4 variables:
>>  ..$ firm    : int [1:800] 2 32 42 52 62 72 102 112 132 162 ...
>>  ..$ industry: num [1:800] 2 2 2 2 2 2 2 2 2 2 ...
>>  ..$ year    : int [1:800] 1998 1998 1998 1998 1998 1998 1998 1998
>> 1998 1998 ...
>>  ..$ X1      : num [1:800] -1.1044 -0.0666 -0.9184 0.3469 -0.2348 ...
>>
>> You can see that the 'name' of the list element is the industry.year
>> combination; this can also be seen in the data.
>>
>> On Mon, Aug 2, 2010 at 6:20 PM, Cecilia Carmo <cecilia.carmo at ua.pt> wrote:
>>>
>>> Thank you for your help but I don't understand how can I have a dataframe
>>> with the columns: firm, year, industry, X1 and X2. Could you help me
>>> (again)?
>>>
>>>
>>> Cecília Carmo
>>>
>>>
>>> Em Sat, 31 Jul 2010 22:10:38 -0400
>>>  jim holtman <jholtman at gmail.com> escreveu:
>>>>
>>>> This will split the data by industry & year and then return the values
>>>> that include the 80%-tile (>=10% & <= 90%)
>>>>
>>>> # split the data by industry/year
>>>> d.s <- split(data, list(data$industry, data$year), drop=TRUE)
>>>> result <- lapply(d.s, function(.id){
>>>>   # get 10/90% values
>>>>   .limit <- quantile(.id$X1, prob=c(.1, .9))
>>>>   subset(.id, X1 >= .limit[1] & X1 <= .limit[2])
>>>> })
>>>>
>>>> This returns a list of 100 elements for each combination.
>>>>
>>>> On Sat, Jul 31, 2010 at 9:39 PM, Cecilia Carmo <cecilia.carmo at ua.pt>
>>>> wrote:
>>>>>
>>>>> Hi everyone!
>>>>>
>>>>> #I need a loop or a function that creates a X2 variable that is X1
>>>>> without
>>>>> the extreme values (or X1 winsorized) by industry and year.
>>>>>
>>>>> #My reproducible example:
>>>>> firm<-sort(rep(1:1000,10),decreasing=F)
>>>>> year<-rep(1998:2007,1000)
>>>>>
>>>>>
>>>>> industry<-rep(c(rep(1,10),rep(2,10),rep(3,10),rep(4,10),rep(5,10),rep(6,10),rep(7,10),rep(8,10),rep(9,10),
>>>>> rep(10,10)),1000)
>>>>> X1<-rnorm(10000)
>>>>> data<-data.frame(firm, industry,year,X1)
>>>>> data
>>>>>
>>>>> The way I’m doing this is very hard. I split my sample by industry and
>>>>> year,
>>>>> for each industry and year I calculate the 10% and 90% quantiles, then
>>>>> I
>>>>> create a X2 variable like this:
>>>>>
>>>>> industry1<-subset(data,data$industry==1)
>>>>>
>>>>> ind1year1999<-subset(industry1,industry1$year==1999)
>>>>> q1<-quantile(ind1year1999$X1,probs=0.1,na.rm=TRUE)
>>>>> q99<-quantile(ind1year1999$X1,probs=0.90,na.rm=TRUE)
>>>>>
>>>>>
>>>>> ind1year1999winsorized<-transform(ind1year1999,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>>
>>>>> ind1year2000<-subset(industry1,industry1$year==2000)
>>>>> q1<-quantile(ind1year2000$X1,probs=0.1,na.rm=TRUE)
>>>>> q99<-quantile(ind1year2000$X1,probs=0.90,na.rm=TRUE)
>>>>>
>>>>>
>>>>> ind1year2000winsorized<-transform(ind1year2000,X2=ifelse(X1<q1,q1,ifelse(X1>q99,q99,X1)))
>>>>>
>>>>> I repeat this for all years and industries, and then I merge/bind all
>>>>> again
>>>>> to have a new dataframe with all the columns of the dataframe «data»
>>>>> plus
>>>>> X2.
>>>>>
>>>>> Could anyone help me doing this in a easier way?
>>>>>
>>>>> Thanks
>>>>> Cecília Carmo
>>>>> Universidade de Aveiro - Portugal
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide
>>>>> http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>
>>>
>>>
>>
>>
>>
>> --
>> Jim Holtman
>> Cincinnati, OH
>> +1 513 646 9390
>>
>> What is the problem that you are trying to solve?
>
>
>



-- 
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list