[R] Subsetting data where the condition is that the value of some column contains some substring

Sat Mar 21 02:49:03 CET 2009

If you use Jim's example and use grep() with ordinary and and then  
negative indexing, you get these results:

 > x[grep("her", x$input),]
                 input output corpusFreq pvolOT pvolRatioOT
2 donate(her,thebook)      P       48.7  68928   0.1899471
6        give(her,it)      P      100.0 157248   0.4333333
8   give(her,thebook)      P        5.7  65232   0.1797619
9      donate(her,it)      P      100.0 152064   0.4190476

 > x[-grep("her", x$input),]
                             input output corpusFreq pvolOT pvolRatioOT
1       give(mysister,theoldbook)      P       47.0  56016   0.1543651
3          give(mysister,thebook)      P       73.4  80136   0.2208333
4     donate(mysister,theoldbook)      P       79.0  57024   0.1571429
5               give(mysister,it)      P      100.0 132408   0.3648810
7             donate(mysister,it)      P      100.0 130720   0.3602293
10   give(mylittlesister,thebook)      P       91.8 112032   0.3087302
11 donate(mylittlesister,thebook)      P       98.4 114048   0.3142857
12       donate(mysister,thebook)      P       94.4  82800   0.2281746

-- 
David.

On Mar 20, 2009, at 9:39 PM, jim holtman wrote:

> grep and regexpr return different values.  regexpr returns a vector of
> the same length as the input and this can be used to construct a
> logical subscript.  grep return a vector of only the matches, in which
> case you can have a length of zero if there are no matches.  Makes it
> harder to create the subsets.  You have to test for zero length and
> then do something special.
>
> On Fri, Mar 20, 2009 at 9:20 PM, Max Bane <max.bane at gmail.com> wrote:
>> Thanks, Jim (and Mark, who replied off-list) -- that does the  
>> trick. I
>> had tried using an index expression with grep, but that failed in the
>> same way as the subset method. It is still rather mysterious why this
>> works with regexpr but not with grep :)
>>
>> -Max
>>
>> On Fri, Mar 20, 2009 at 7:57 PM, jim holtman <jholtman at gmail.com>  
>> wrote:
>>> Try using regexpr instead:
>>>
>>>> x <- read.table(textConnection("input output corpusFreq pvolOT  
>>>> pvolRatioOT
>>> + give(mysister,theoldbook)      P       47.0  56016   0.1543651
>>> + donate(her,thebook)      P       48.7  68928   0.1899471
>>> + give(mysister,thebook)      P       73.4  80136   0.2208333
>>> + donate(mysister,theoldbook)      P       79.0  57024   0.1571429
>>> + give(mysister,it)      P      100.0 132408   0.3648810
>>> + give(her,it)      P      100.0 157248   0.4333333
>>> + donate(mysister,it)      P      100.0 130720   0.3602293
>>> + give(her,thebook)      P        5.7  65232   0.1797619
>>> + donate(her,it)      P      100.0 152064   0.4190476
>>> + give(mylittlesister,thebook)      P       91.8 112032   0.3087302
>>> + donate(mylittlesister,thebook)      P       98.4 114048    
>>> 0.3142857
>>> + donate(mysister,thebook)      P       94.4  82800   0.2281746"),  
>>> header=TRUE)
>>>> # use regexpr
>>>> matched <- regexpr("her", x$input) != -1
>>>> notMatched <- !matched
>>>> x[matched,]
>>>                input output corpusFreq pvolOT pvolRatioOT
>>> 2 donate(her,thebook)      P       48.7  68928   0.1899471
>>> 6        give(her,it)      P      100.0 157248   0.4333333
>>> 8   give(her,thebook)      P        5.7  65232   0.1797619
>>> 9      donate(her,it)      P      100.0 152064   0.4190476
>>>> x[notMatched,]
>>>                            input output corpusFreq pvolOT  
>>> pvolRatioOT
>>> 1       give(mysister,theoldbook)      P       47.0  56016    
>>> 0.1543651
>>> 3          give(mysister,thebook)      P       73.4  80136    
>>> 0.2208333
>>> 4     donate(mysister,theoldbook)      P       79.0  57024    
>>> 0.1571429
>>> 5               give(mysister,it)      P      100.0 132408    
>>> 0.3648810
>>> 7             donate(mysister,it)      P      100.0 130720    
>>> 0.3602293
>>> 10   give(mylittlesister,thebook)      P       91.8 112032    
>>> 0.3087302
>>> 11 donate(mylittlesister,thebook)      P       98.4 114048    
>>> 0.3142857
>>> 12       donate(mysister,thebook)      P       94.4  82800    
>>> 0.2281746
>>>>
>>>>
>>>
>>>
>>> On Fri, Mar 20, 2009 at 8:25 PM, Max Bane <max.bane at gmail.com>  
>>> wrote:
>>>> I have some data that looks like this:
>>>>
>>>>> dataP
>>>>                                input output corpusFreq pvolOT  
>>>> pvolRatioOT
>>>> 1       give(my sister, the old book)      P       47.0  56016    
>>>> 0.1543651
>>>> 5               donate(her, the book)      P       48.7  68928    
>>>> 0.1899471
>>>> 9           give(my sister, the book)      P       73.4  80136    
>>>> 0.2208333
>>>> 13    donate(my sister, the old book)      P       79.0  57024    
>>>> 0.1571429
>>>> 20                give(my sister, it)      P      100.0 132408    
>>>> 0.3648810
>>>> 21                      give(her, it)      P      100.0 157248    
>>>> 0.4333333
>>>> 24              donate(my sister, it)      P      100.0 130720    
>>>> 0.3602293
>>>> 28                give(her, the book)      P        5.7  65232    
>>>> 0.1797619
>>>> 31                    donate(her, it)      P      100.0 152064    
>>>> 0.4190476
>>>> 35   give(my little sister, the book)      P       91.8 112032    
>>>> 0.3087302
>>>> 39 donate(my little sister, the book)      P       98.4 114048    
>>>> 0.3142857
>>>> 43        donate(my sister, the book)      P       94.4  82800    
>>>> 0.2281746
>>>>
>>>> I would like to extract the subset of this data in which the  
>>>> value of
>>>> the "input" column contains the substring "her". I was thinking I
>>>> could use the grep function to test for the presence of this
>>>> substring. For instance, if a string does not contain it, then grep
>>>> returns a zero length integer vector:
>>>>
>>>>> grep("her", "give(my sister, it)")
>>>> integer(0)
>>>>
>>>> And if the string does contain the substring, grep returns a  
>>>> vector of
>>>> the indices where the substring is located:
>>>>
>>>>> grep("her", "give(her, it)")
>>>> [1] 1
>>>>
>>>> I can thus test for the presence of the substring by converting the
>>>> length of the result of grep into a boolean:
>>>>
>>>>> as.logical(length(grep("her", "give(my sister, it)")))
>>>> [1] FALSE
>>>>> as.logical(length(grep("her", "give(her, it)")))
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(her, it)"))) == TRUE
>>>> [1] TRUE
>>>>> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
>>>> [1] FALSE
>>>>
>>>> I would like to use this test as a criterion for constructing a  
>>>> subset
>>>> of my data. Unfortunately, it does not work:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
>>>>                                input output corpusFreq pvolOT  
>>>> pvolRatioOT
>>>> 1       give(my sister, the old book)      P       47.0  56016    
>>>> 0.1543651
>>>> 5               donate(her, the book)      P       48.7  68928    
>>>> 0.1899471
>>>> 9           give(my sister, the book)      P       73.4  80136    
>>>> 0.2208333
>>>> 13    donate(my sister, the old book)      P       79.0  57024    
>>>> 0.1571429
>>>> 20                give(my sister, it)      P      100.0 132408    
>>>> 0.3648810
>>>> 21                      give(her, it)      P      100.0 157248    
>>>> 0.4333333
>>>> 24              donate(my sister, it)      P      100.0 130720    
>>>> 0.3602293
>>>> 28                give(her, the book)      P        5.7  65232    
>>>> 0.1797619
>>>> 31                    donate(her, it)      P      100.0 152064    
>>>> 0.4190476
>>>> 35   give(my little sister, the book)      P       91.8 112032    
>>>> 0.3087302
>>>> 39 donate(my little sister, the book)      P       98.4 114048    
>>>> 0.3142857
>>>> 43        donate(my sister, the book)      P       94.4  82800    
>>>> 0.2281746
>>>>
>>>> As you can see, I get back the whole data set, rather than just the
>>>> subset where the input column contains "her". And if I invert the
>>>> test, which I would expect to give the subset *not* containing  
>>>> "her",
>>>> I instead get the empty subset, rather mysteriously:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
>>>> [1] input       output      corpusFreq  pvolOT      pvolRatioOT
>>>> <0 rows> (or 0-length row.names)
>>>>
>>>> The type of the input column is definitely character. To be  
>>>> double sure:
>>>>
>>>>> subset(dataP, as.logical(length(grep("her",  
>>>>> as.character(input))))==TRUE)
>>>>
>>>> does the same thing.
>>>>
>>>> Could somebody with more R experience than I have please explain  
>>>> what
>>>> I am doing wrong here? I'll be much obliged.
>>>>
>>>
>>

David Winsemius, MD
Heritage Laboratories
West Hartford, CT