[R] matched samples, dataframe, panel data

Fri Jun 7 17:25:41 CEST 2013

Hi,
May be this helps:
 lst1<-split(final3,list(final3$year,final3$industry))
lst2<-lst1[lapply(lst1,nrow)>0]
lst3<-lapply(lst2,function(x) lapply(x$dimension,function(y) x[(y< (x$dimension+x$dimension*0.1)) & (y> (x$dimension-x$dimension*0.1)),]))
lst4<-lapply(lst3,function(x) x[lapply(x,nrow)==2])
lst5<-lapply(lst4,function(x)x[!duplicated(x)])
lst6<-lst5[lapply(lst5,length)>0]

names(lst6)
# [1] "2000.20" "2001.20" "2002.20" "2003.20" "2004.20" "2001.30" "2002.30"
 #[8] "2001.40" "2002.40" "2003.40" "2004.40"

lst6["2000.20"]
#$`2000.20`
#$`2000.20`[[1]]
 #  firm year industry dummy dimension
#1     1 2000       20     0      2120
#21    5 2000       20     1      2189
#
#$`2000.20`[[2]]
 #  firm year industry dummy dimension
#16    4 2000       20     0      3178
#31    7 2000       20     1      3245
#
#$`2000.20`[[3]]
 #  firm year industry dummy dimension
#11    3 2000       20     1      4532
#6     2 2000       20     0      4890
A.K.

________________________________
From: Cecilia Carmo <cecilia.carmo at ua.pt>
To: "r-help at r-project.org" <r-help at r-project.org> 
Cc: "smartpink111 at yahoo.com" <smartpink111 at yahoo.com> 
Sent: Friday, June 7, 2013 9:56 AM
Subject: Re: [R] matched samples, dataframe, panel data

Again my problem, better explained.

#I have a data panel of thousands of firms, by year and industry and 
#one dummy variable that identifies one kind of firms (1 if the firm have an auditor; 0 if not)
#and another variable the represents the firm dimension (total assets in thousand of euros)
#I need to create two separated samples with the same number os firms where 
#one firm in the first have a corresponding firm in the second with the same 
#year, industry and dimension (the dimension doesn't need to be exatly the
#same, it could vary in an interval of +/- 10%, for example)

#My reproducible example
firm1<-sort(rep(1:10,5),decreasing=F)
year1<-rep(2000:2004,10)
industry1<-rep(20,50)
dummy1<-c(0,0,1,1,0,0,1,1,0,1,1,1,0,0,0,0,0,0,1,1,1,1,0,0,0,0,0,0,0,0,1,0,1,0,1,1,1,1,1,0,0,1,0,0,0,0,0,1,1,1)
dimension1<-c(2120,345,2341,5678,10900,4890,2789,3412,9500,8765,4532,6593,12900,123,2345,3178,2678,6666,647,23789,
2189,4289,8543,637,23456,781,35489,2345,5754,8976,3245,1234,25,1200,2345,2765,389,23456,2367,3892,5438,37824,
23,2897,3456,7690,6022,3678,9431,2890)
data1<-data.frame(firm1,year1,industry1,dummy1,dimension1)
data1
colnames(data1)<-c("firm","year","industry","dummy","dimension")

firm2<-sort(rep(11:15,3),decreasing=F)
year2<-rep(2001:2003,5)
industry2<-rep(30,15)
dummy2<-c(0,0,0,0,0,0,1,1,1,1,1,1,1,0,1)
dimension2<-c(12456,781,32489,2345,5754,8976,3245,2120,345,2341,5678,10900,12900,123,2345)
data2<-data.frame(firm2,year2,industry2,dummy2,dimension2)
data2
colnames(data2)<-c("firm","year","industry","dummy","dimension")
firm3<-sort(rep(16:20,4),decreasing=F)
year3<-rep(2001:2004,5)
industry3<-rep(40,20)
dummy3<-c(0,0,1,0,1,0,1,0,1,1,1,1,1,0,0,0,0,1,0,0)
dimension3<-c(23456,1181,32489,2345,6754,8976,3245,1234,1288,1200,2345,2765,389,23456,2367,3892,6438,24824,
23,2897)
data3<-data.frame(firm3,year3,industry3,dummy3,dimension3)
data3
colnames(data3)<-c("firm","year","industry","dummy","dimension")

final1<-rbind(data1,data2)
final2<-rbind(final1,data3)
final2
final3<-final2[order(final2$year,final2$industry,final2$dimension),]
final3

#So my data is final3 is like this: 
   firm year industry dummy dimension
26    6 2000       20     0       781
1     1 2000       20     0      2120
21    5 2000       20     1      2189
36    8 2000       20     1      2765
16    4 2000       20     0      3178
31    7 2000       20     1      3245
11    3 2000       20     1      4532
6     2 2000       20     0      4890
41    9 2000       20     0      5438
46   10 2000       20     0      7690
2     1 2001       20     0       345
37    8 2001       20     1       389
32    7 2001       20     0      1234
17    4 2001       20     0      2678
7     2 2001       20     1      2789
22    5 2001       20     1      4289
47   10 2001       20     0      6022
12    3 2001       20     1      6593
27    6 2001       20     0     35489
42    9 2001       20     1     37824
60   14 2001       30     1      2341
54   12 2001       30     0      2345
57   13 2001       30     1      3245
51   11 2001       30     0     12456
63   15 2001       30     1     12900
78   19 2001       40     1       389
74   18 2001       40     1      1288
82   20 2001       40     0      6438
70   17 2001       40     1      6754
66   16 2001       40     0     23456
43    9 2002       20     0        23
33    7 2002       20     1        25
3     1 2002       20     1      2341
28    6 2002       20     0      2345
8     2 2002       20     1      3412
48   10 2002       20     1      3678
18    4 2002       20     0      6666
23    5 2002       20     0      8543
13    3 2002       20     0     12900
38    8 2002       20     1     23456
64   15 2002       30     0       123
52   11 2002       30     0       781
58   13 2002       30     1      2120
61   14 2002       30     1      5678
55   12 2002       30     0      5754
67   16 2002       40     0      1181
75   18 2002       40     1      1200
71   17 2002       40     0      8976
79   19 2002       40     0     23456
83   20 2002       40     1     24824
14    3 2003       20     0       123
24    5 2003       20     0       637
19    4 2003       20     1       647
34    7 2003       20     0      1200
39    8 2003       20     1      2367
44    9 2003       20     0      2897
4     1 2003       20     1      5678
29    6 2003       20     0      5754
49   10 2003       20     1      9431
9     2 2003       20     0      9500
59   13 2003       30     1       345
65   15 2003       30     1      2345
56   12 2003       30     0      8976
62   14 2003       30     1     10900
53   11 2003       30     0     32489
84   20 2003       40     0        23
76   18 2003       40     1      2345
80   19 2003       40     0      2367
72   17 2003       40     1      3245
68   16 2003       40     1     32489
15    3 2004       20     0      2345
35    7 2004       20     1      2345
50   10 2004       20     1      2890
45    9 2004       20     0      3456
40    8 2004       20     0      3892
10    2 2004       20     1      8765
30    6 2004       20     0      8976
5     1 2004       20     0     10900
25    5 2004       20     0     23456
20    4 2004       20     1     23789
73   17 2004       40     0      1234
69   16 2004       40     0      2345
77   18 2004       40     1      2765
85   20 2004       40     0      2897
81   19 2004       40     0      3892

I want to keep couples of firms one with dummy=1 and other with dummy=0 that matchs in industry, firm and dimension.

But dimension doesn't need to be exactly the same, it is why I refer an interval of + or - 10%.

For example firm 1 matchs with firm 5, because they have the same year, industry, dimension (10% x 2120 = 212 and 2189-2120<212)
and firm 1 is dummy=0 and firm 5 is dummy=1.

So I want to delete firm 6 because it doesn't macth with any firm, and keep firm 1 and 5.

     firm year industry dummy dimension
26    6 2000       20     0       781
1     1 2000       20     0      2120
21    5 2000       20     1      2189

Next,

Now I can match firm 4 with firm 7 and delete firm 8.
36    8 2000       20     1      2765
16    4 2000       20     0      3178
31    7 2000       20     1      3245

And so on...

At the end I want to keep only pairs of firms, matched by year, industry and dimension.

If I separate firms with dummy=1 from firms with dummy=0 in two separated dataframes, I have two matched samples
with the same number of observations. That's what I want.

Thank you,
Cecília Carmo 
Universidade de Aveiro - Portugal