[R] Creating binary variable depending on strings of two dataframes

David Winsemius dwinsemius at comcast.net
Tue May 10 16:05:27 CEST 2011


On May 10, 2011, at 9:49 AM, noxyport at gmail.com wrote:

> On Tue, May 10, 2011 at 3:09 PM, David Winsemius <dwinsemius at comcast.net 
> >
> wrote:
>>
>> On May 10, 2011, at 3:18 AM, noxyport at gmail.com wrote:
>>
>>> On Fri, May 6, 2011 at 7:41 PM, David Winsemius <dwinsemius at comcast.net 
>>> >
>>> wrote:
>>>>
>>>> On May 6, 2011, at 11:35 AM, Pete Pete wrote:
>>>>
>>>>>
>>>>> Gabor Grothendieck wrote:
>>>>>>
>>>>>> On Tue, Dec 7, 2010 at 11:30 AM, Pete Pete  
>>>>>> <noxyport at gmail.com>
>>>>>> wrote:
>>>>>>>
>>>>>>> Hi,
>>>>>>> consider the following two dataframes:
>>>>>>> x1=c("232","3454","3455","342","13")
>>>>>>> x2=c("1","1","1","0","0")
>>>>>>> data1=data.frame(x1,x2)
>>>>>>>
>>>>>>> y1=c("232","232","3454","3454","3455","342","13","13","13","13")
>>>>>>> y2=c("E1","F3","F5","E1","E2","H4","F8","G3","E1","H2")
>>>>>>> data2=data.frame(y1,y2)
>>>>>>>
>>>>>>> I need a new column in dataframe data1 (x3), which is either 0  
>>>>>>> or 1
>>>>>>> depending if the value "E1" in y2 of data2 is true while  
>>>>>>> x1=y1. The
>>>>>>> result
>>>>>>> of data1 should look like this:
>>>>>>> x1     x2 x3
>>>>>>> 1 232   1   1
>>>>>>> 2 3454 1   1
>>>>>>> 3 3455 1   0
>>>>>>> 4 342   0   0
>>>>>>> 5 13     0   1
>>>>>>>
>>>>>>> I think a SQL command could help me but I am too inexperienced  
>>>>>>> with
> it
>>>>>>> to
>>>>>>> get there.
>>>>>>>
>>>>>>
>>>>>> Try this:
>>>>>>
>>>>>>> library(sqldf)
>>>>>>> sqldf("select x1, x2, max(y2 = 'E1') x3 from data1 d1 left  
>>>>>>> join data2
>>>>>>> d2
>>>>>>> on (x1 = y1) group by x1, x2 order by d1.rowid")
>>>>>>
>>>>>> x1 x2 x3
>>>>>> 1  232  1  1
>>>>>> 2 3454  1  1
>>>>>> 3 3455  1  0
>>>>>> 4  342  0  0
>>>>>> 5   13  0  1
>>>>>>
>>>>>>
>>>> snipped Gabor's sig
>>>>>
>>>>> That works pretty cool but I need to automate this a bit more.  
>>>>> Consider
>>>>> the
>>>>> following example:
>>>>>
>>>>> list1=c("A01","B04","A64","G84","F19")
>>>>>
>>>>> x1=c("232","3454","3455","342","13")
>>>>> x2=c("1","1","1","0","0")
>>>>> data1=data.frame(x1,x2)
>>>>>
>>>>> y1=c("232","232","3454","3454","3455","342","13","13","13","13")
>>>>> y2=c("E13","B04","F19","A64","E22","H44","F68","G84","F19","A01")
>>>>> data2=data.frame(y1,y2)
>>>>>
>>>>> I want now to creat a loop, which creates for every value in  
>>>>> list1 a
> new
>>>>> binary variable in data1. Result should look like:
>>>>> x1      x2      A01     B04     A64     G84     F19
>>>>> 232     1       0       1       0       0       0
>>>>> 3454    1       0       0       1       0       1
>>>>> 3455    1       0       0       0       0       0
>>>>> 342     0       0       0       0       0       0
>>>>> 13      0       1       0       0       1       1
>>>>
>>>> Loops!?! We don't nee no steenking loops!
>>>>
>>>>> xtb <-  with(data2, table(y1,y2))
>>>>> cbind(data1, xtb[match(data1$x1, rownames(xtb)), ] )
>>>>
>>>>    x1 x2 A01 A64 B04 E13 E22 F19 F68 G84 H44
>>>> 232   232  1   0   0   1   1   0   0   0   0   0
>>>> 3454 3454  1   0   1   0   0   0   1   0   0   0
>>>> 3455 3455  1   0   0   0   0   1   0   0   0   0
>>>> 342   342  0   0   0   0   0   0   0   0   0   1
>>>> 13     13  0   1   0   0   0   0   1   1   1   0
>>>>
>>>> I am guessing that you were to ... er, busy? ... to complete the  
>>>> table?
>>>>
>>>> --
>>>>
>>>> David Winsemius, MD
>>>> West Hartford, CT
>>>>
>>>>
>>>
>>> Thanks a lot! Pretty simple. I am so much used to SQLDF right now.
>>>
>>> So how would you handle more complicated strings like that:
>>> y1=c("232","232", "232",  
>>> "3454","3454","3455","342","13","13","13","13")
>>> y2=c("E13","B04 A01 F19","B04","F19","A64 G84 A05","E22","H44
>>> C35","F68","G84","F19","A01")
>>> data2=data.frame(y1,y2)
>>>
>>> Where you want to extract for instance all "A01" from the strings?
>>
>> I think you need either to explain what you want in more words of the
>> English language or to offer an example of the desired output. I  
>> suspect
> you
>> did not want something as simple as this:
>>
>>> A01.instances <- grep("A01" , data2$y2)
>>> A01.instances
>> [1]  2 11
>>> data2[A01.instances, ]
>>  y1          y2
>> 2  232 B04 A01 F19
>> 11  13         A01
>>
>> Or maybe you did?
>>
>> --
>> David Winsemius, MD
>> West Hartford, CT
>>
>>
>
> No, that was not my intention. Consider the following example:
>
> list1=c("A01","B04","A64","G84","F19") # My "substrings" to screen  
> for in
>> data2
>>
>>
>> x1=c("232","3454","3455","342","13")
>> x2=c("1","1","1","0","0")
>> data1=data.frame(x1,x2) # Target dataframe where the 5 new binary  
>> variables
>> (namely from list1) are added
>>
>>
>> y1=c("232","232", "232",  
>> "3454","3454","3455","342","13","13","13","13")
>> y2=c("E133","B04 A01A F194","B04","F19","A642 G84 A05","E223","H44
>> C35","F68","G84","F19","A01")
>> data2=data.frame(y1,y2) # Dataframe to be screen by list1
>>
>
> Result should look like this:
>
> x1      x2      A01     B04     A64     G84     F19
>> 232     1       1       1       0       0       0
>> 3454    1       0       0       1       0       1
>> 3455    1       0       0       0       0       0
>> 342     0       0       0       0       0       0
>> 13      0       1       0       0       1       1

And how were we supposed to figure out that 3454/G84 was not supposed  
to be counted?

OK. let's assume you were just sloppy ... then build a new data.frame:

 > data4 <- data.frame(y1=rep(data3[,1],
                     sapply (strsplit(gsub("\\\n"," ",data3$y2), " "),  
length) ),
                      y2 = unlist (strsplit(gsub("\\\n"," ",data3$y2),  
" ") ) + )
 > data4
     y1  y2
1   232 E13
2   232 B04
3   232 A01
4   232 F19
5   232 B04
6  3454 F19
7  3454 A64
8  3454 G84
9  3454 A05
10 3455 E22
11  342 H44
12  342 C35
13   13 F68
14   13 G84
15   13 F19
16   13 A01



-- 
David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list