[R] how to delete specific rows in a data frame where the first column matches any string from a list

Andrew Choens andy.choens at gmail.com
Mon Feb 9 02:50:35 CET 2009


Interesting. Thanks.

On Sat, 2009-02-07 at 02:36 +0100, Wacek Kusnierczyk wrote:
> Andrew Choens wrote:
> > I regularly deal with a similar pattern at work. People send me these
> > big long .csv files and I have to run them through some pattern analysis
> > to decide which rows I keep and which rows I kill off.
> >
> > As others have mentioned, Perl is a good candidate for this task.
> > Another option would be a quick SQL query. It should be a snap to pull
> > this into something like Access or OOo Base . . . . or better yet,  a
> > real database like Postgres, MySQL, etc.
> >
> > In case you aren't too familiar with SQL, this query could be done by
> > deleting the rows using a self join (syntax varies by product).
> >
> > But, if the pattern is as simple as it sounds and / or this is a
> > one-time job, using SQL is over-kill for the situation.
> >
> > I often use sed in places where Perl is over-kill, but I can't think of
> > any way to match from row to row with sed. If anyone knows how to do
> > this with sed, it would (probably) be easier than trying to learn how to
> > use perl. And, I would like to know how to do this with sed too.
> >
> >   
> 
> (this is actually off-topic, but since it may be interesting for the
> general public, i keep the response cc: to r-help)
> 
> yes, you can do this with sed.  suppose you have two files, one (say,
> sample.txt) with the data to be filtered, record fields separated by,
> e.g., a tab character, and another (say, filter.txt) with patterns to be
> matched.  a row from the first is passed to output only of its second
> field does not match any of the patterns -- this corresponds to (a
> simplified version of) the original problem.
> 
> then, the following should do:
> 
> sed "$(sed 's/^/\/^[^\\t]\\+\\t/; s/$/\/d/' filter.txt)" sample.txt >
> filtered-sample.txt
> 
> (unless the patterns contain characters that interfere with the shell or
> sed's syntax, in which case they'd have to be appropriately escaped.)
> 
> vQ
> 
> 
> 
> 
> 
-- 
This is the price and the promise of citizenship.
        -- Barack Obama, 44th President of the United States




More information about the R-help mailing list