[R] how to delete specific rows in a data frame where the first column matches any string from a list

Mon Feb 9 11:17:13 CET 2009

Andrew Choens wrote:
> Interesting. Thanks.
>
> On Sat, 2009-02-07 at 02:36 +0100, Wacek Kusnierczyk wrote:
>   
>> yes, you can do this with sed.  suppose you have two files, one (say,
>> sample.txt) with the data to be filtered, record fields separated by,
>> e.g., a tab character, and another (say, filter.txt) with patterns to be
>> matched.  a row from the first is passed to output only of its second
>> field does not match any of the patterns -- this corresponds to (a
>> simplified version of) the original problem.
>>
>> then, the following should do:
>>
>> sed "$(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt)" sample.txt >
>> filtered-sample.txt
>>
>> (unless the patterns contain characters that interfere with the shell or
>> sed's syntax, in which case they'd have to be appropriately escaped.)
>>     

note, this will do only the part of the original task that demands rows
from one file to be removed where the second field matches any of the
patterns specified in another file.  the other part of the original task
is a little bit more involving (form sed's perspective):  the task is to
remove all rows where the second field has already been the second field
of some preceding row.  here's one an all-sed solution:

sed "$(sed
's|^[^\t]*\t\([^\t]\+\)\t.*|0,/^[^\\t]*\\t\1\\t/\!{/^[^\\t]*\\t\1\\t/d}|'
sample.txt)" sample.txt

btw. the above needs a minor correction, and can be further simplified:

sed "$(sed 's|.*|/^[^\\t]*\\t&\\t/d|' filter.txt)" sample.txt

piping these two should solve the whole task:

sed "$(sed
's|^[^\t]*\t\([^\t]\+\)\t.*|0,/^[^\\t]*\\t\1\\t/\!{/^[^\\t]*\\t\1\\t/d}|'
sample.txt)" sample.txt | sed "$(sed 's|.*|/^[^\\t]*\\t&\\t/d|' filter.txt)"

vQ