[R] Subsetting data where the condition is that the value of some column contains some substring

Sat Mar 21 01:25:39 CET 2009

I have some data that looks like this:

> dataP
                                input output corpusFreq pvolOT pvolRatioOT
1       give(my sister, the old book)      P       47.0  56016   0.1543651
5               donate(her, the book)      P       48.7  68928   0.1899471
9           give(my sister, the book)      P       73.4  80136   0.2208333
13    donate(my sister, the old book)      P       79.0  57024   0.1571429
20                give(my sister, it)      P      100.0 132408   0.3648810
21                      give(her, it)      P      100.0 157248   0.4333333
24              donate(my sister, it)      P      100.0 130720   0.3602293
28                give(her, the book)      P        5.7  65232   0.1797619
31                    donate(her, it)      P      100.0 152064   0.4190476
35   give(my little sister, the book)      P       91.8 112032   0.3087302
39 donate(my little sister, the book)      P       98.4 114048   0.3142857
43        donate(my sister, the book)      P       94.4  82800   0.2281746

I would like to extract the subset of this data in which the value of
the "input" column contains the substring "her". I was thinking I
could use the grep function to test for the presence of this
substring. For instance, if a string does not contain it, then grep
returns a zero length integer vector:

> grep("her", "give(my sister, it)")
integer(0)

And if the string does contain the substring, grep returns a vector of
the indices where the substring is located:

> grep("her", "give(her, it)")
[1] 1

I can thus test for the presence of the substring by converting the
length of the result of grep into a boolean:

> as.logical(length(grep("her", "give(my sister, it)")))
[1] FALSE
> as.logical(length(grep("her", "give(her, it)")))
[1] TRUE
> as.logical(length(grep("her", "give(her, it)"))) == TRUE
[1] TRUE
> as.logical(length(grep("her", "give(my sister, it)"))) == TRUE
[1] FALSE

I would like to use this test as a criterion for constructing a subset
of my data. Unfortunately, it does not work:

> subset(dataP, as.logical(length(grep("her", input)))==TRUE)
                                input output corpusFreq pvolOT pvolRatioOT
1       give(my sister, the old book)      P       47.0  56016   0.1543651
5               donate(her, the book)      P       48.7  68928   0.1899471
9           give(my sister, the book)      P       73.4  80136   0.2208333
13    donate(my sister, the old book)      P       79.0  57024   0.1571429
20                give(my sister, it)      P      100.0 132408   0.3648810
21                      give(her, it)      P      100.0 157248   0.4333333
24              donate(my sister, it)      P      100.0 130720   0.3602293
28                give(her, the book)      P        5.7  65232   0.1797619
31                    donate(her, it)      P      100.0 152064   0.4190476
35   give(my little sister, the book)      P       91.8 112032   0.3087302
39 donate(my little sister, the book)      P       98.4 114048   0.3142857
43        donate(my sister, the book)      P       94.4  82800   0.2281746

As you can see, I get back the whole data set, rather than just the
subset where the input column contains "her". And if I invert the
test, which I would expect to give the subset *not* containing "her",
I instead get the empty subset, rather mysteriously:

> subset(dataP, as.logical(length(grep("her", input)))==FALSE)
[1] input       output      corpusFreq  pvolOT      pvolRatioOT
<0 rows> (or 0-length row.names)

The type of the input column is definitely character. To be double sure:

> subset(dataP, as.logical(length(grep("her", as.character(input))))==TRUE)

does the same thing.

Could somebody with more R experience than I have please explain what
I am doing wrong here? I'll be much obliged.

-- 
Max Bane
PhD Student, Linguistics
University of Chicago
bane at uchicago.edu