[R] Selecting columns whose names contain "mutated" except when they also contain "non" or "un"

Fri Apr 27 20:51:27 CEST 2012

Hi Greg,

This is very helpful. Thanks for explaining it. I'm clearly going to need to improve my understanding of regular expressions. Currently busy trying to figure out Sweave and knitr though.

Paul

--- On Thu, 4/26/12, Greg Snow <538280 at gmail.com> wrote:

> From: Greg Snow <538280 at gmail.com>
> Subject: Re: [R] Selecting columns whose names contain "mutated" except when they also contain "non" or "un"
> To: "Paul Miller" <pjmiller_57 at yahoo.com>
> Cc: r-help at r-project.org
> Received: Thursday, April 26, 2012, 1:55 PM
> Sorry I took so long getting back to
> this, but the paying job needs to
> take priority.
> 
> The regular expression "(?<!un)(?<!non)muta" 
> looks for a string that
> matches "muta" then looks at the characters immediately
> before it to
> see if they match either "un" or "non" in which case it
> makes it a not
> match.  More specifically the regular expression engine
> steps through
> the string and at each point tries the match, so at a given
> point it
> will first see if "un" is before that point, if it is then
> this point
> can't match and it moves the checking point, if it is not
> "un" then it
> moves to the next negative look behind and sees if "non" is
> just
> before the point.  If neither "un" or "non" are just
> before the point
> then it starts matching characters after the point to see if
> they
> match "muta".
> 
> So the next pattern is "(?!muta)non|un", the (?!muta) is a
> negative
> look ahead which starts at the point and checks forward to
> see that
> the next characters are not "muta" (but does not include
> them in the
> match), in this case it is a no-op because you are saying
> that you
> want to match at a point where the next characters are not
> "muta" but
> are "non"  and since the next set of characters cannot
> be both this is
> the same as just matching "non", also you need to be aware
> of the
> operator precedence, in that pattern the (?!muta) part only
> applied to
> the "non", not the "un".
> 
> To match "nonmuta" or "unmuta" a simple pattern would just
> be
> "(non|un)muta" or "(no|u)nmuta".  You could use the
> positive
> lookbehind (you would still need an "or"), but it would be
> overkill
> for a grep command.  The difference in the positive
> look ahead/behind
> is more important for replacing where the look ahead/behind
> is needed
> for the match to happen, but is not captured as part of the
> match to
> be replaced.
> 
> 
> 
> On Tue, Apr 24, 2012 at 7:40 AM, Paul Miller <pjmiller_57 at yahoo.com>
> wrote:
> > Hi Greg,
> >
> > This is quite helpful. Not so good yet with regular
> expressions in general or Perl-like regular expressions.
> Found the help page though, and think I was able to
> determine how the code works as well as how I would select
> only instances where "muta" is preceeded by either "non" or
> "un".
> >
> >> (tmp <-
> c('mutation','nonmutated','unmutated','verymutated','other'))
> > [1] "mutation"    "nonmutated"  "unmutated"  
> "verymutated" "other"
> >
> >> grep("(?<!un)(?<!non)muta", tmp, perl=TRUE)
> > [1] 1 4
> >
> >> grep("(?!muta)non|un", tmp, perl=TRUE)
> > [1] 2 3
> >
> > Did I get the second grep right?
> >
> > If so, do you have any sense of why it seems to fail
> when I apply it to my data?
> >
> >> KRASyn$NonMutant_comb <-
> rowSums(KRASyn[grep("(?!muta)non|un", names(KRASyn),
> perl=TRUE)])
> >
> > Error in rowSums(KRASyn[grep("(?!muta)non|un",
> names(KRASyn), perl = TRUE)]) :
> >  'x' must be numeric
> >
> > Thanks,
> >
> > Paul
> >
> 
> 
> 
> -- 
> Gregory (Greg) L. Snow Ph.D.
> 538280 at gmail.com
>