[R] Regular Expressions + Matrices

William Dunlap wdunlap at tibco.com
Fri Aug 10 20:43:30 CEST 2012


If you think about this as a runs problem you can get a loopless solution
that I think is easier to read (once the requisite functions are defined).

First define the function to canonicalize the name
   nickname <- function(x) sub(" .*", "", x)
then define some handy runs functions
  isFirstInRun <- function(x) c(TRUE, x[-1] != x[-length(x)])
  isJustBefore <- function(x) c(x[-1], FALSE) # x should be logical
then use those functions on your dataset
  > nearDup <- !isFirstInRun(nickname(d$NAME)) & isFirstInRun(d$YEAR)
  > d[ nearDup | isJustBefore(nearDup), ]
    ID             NAME YEAR      SOURCE
  1  1    New York Mets 1900        ESPN
  2  2 New York Yankees 1920 Cooperstown
See how it works with triplicates as well
  > dd <- rbind(d, data.frame(ID=6:8,
                          NAME=c("Chicago Blacksox", "Chicago Cubs", "Chicago Whitesox"),
                          YEAR=1701:1703, SOURCE=rep("made up", 3)))
  > nearDup <- !isFirstInRun(nickname(dd$NAME)) & isFirstInRun(dd$YEAR)
  > dd[ nearDup | isJustBefore(nearDup), ]
    ID             NAME YEAR      SOURCE
  1  1    New York Mets 1900        ESPN
  2  2 New York Yankees 1920 Cooperstown
  6  6 Chicago Blacksox 1701     made up
  7  7     Chicago Cubs 1702     made up
  8  8 Chicago Whitesox 1703     made up

Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com


> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf
> Of Rui Barradas
> Sent: Friday, August 10, 2012 11:18 AM
> To: Fred G
> Cc: r-help
> Subject: Re: [R] Regular Expressions + Matrices
> 
> Hello,
> 
> Try the following.
> 
> 
> d <- read.table(textConnection("
> ID NAME                          YEAR     SOURCE
> 1  'New York Mets'               1900      ESPN
> 2  'New York Yankees'          1920     Cooperstown
> 3  'Boston Redsox'               1918      ESPN
> 4  'Washington Nationals'      2010     ESPN
> 5  'Detroit Tigers'                  1990      ESPN
> "), header=TRUE)
> 
> d$NAME <- as.character(d$NAME)
> 
> fun <- function(i, x){
>      if(x[i, "ID"] != x[i + 1, "ID"]){
>          s <- unlist(strsplit(x[i, "NAME"], "[[:space:]]"))[1]
>          if(grepl(s, x[i + 1, "NAME"])) return(TRUE)
>      }
>      FALSE
> }
> 
> inx <- sapply(seq_len(nrow(d) - 1), fun, d)
> inx <- c(inx, FALSE) | c(FALSE, inx)
> d[inx, ]
> 
> Hope this helps,
> 
> Rui Barradas
> Em 10-08-2012 18:41, Fred G escreveu:
> > Hi all,
> >
> > My code looks like the following:
> > inname = read.csv("ID_error_checker.csv", as.is=TRUE)
> > outname = read.csv("output.csv", as.is=TRUE)
> >
> > #My algorithm is the following:
> > #for line in inname
> > #if first string up to whitespace in row in inname$name = first string up
> > to whitespace in row + 1 in inname$name
> > #AND ID in inname$ID for the top row NOT EQUAL ID in inname$ID for the row
> > below it
> > #copy these two lines to a new file
> >
> > In other words, if the name (up to the first whitespace) in the first row
> > equals the name in the second row (etc for whole file) and the ID in the
> > first row does not equal the ID in the second row, copy both of these rows
> > in full to a new file.  Only caveat is that I want a regular expression not
> > to take the full names, but just the first string up to the first
> > whitespace in the inname$name column (ie if row1 has a name of: New York
> > Mets and row2 has a name of New York Yankees, I would want both of these
> > rows to be copied in full since "New" is the same in both...)
> >
> > Here is some example data:
> > ID NAME                          YEAR     SOURCE     NOTES
> > 1  New York Mets               1900      ESPN
> > 2  New York Yankees          1920     Cooperstown
> > 3  Boston Redsox               1918      ESPN
> > 4  Washington Nationals      2010     ESPN
> > 5  Detroit Tigers                  1990      ESPN
> >
> > The desired output would be:
> > ID   NAME                    YEAR SOURCE
> > 1    New York Mets        1900   ESPN
> > 2    New York Yankees   1920   Cooperstown
> >
> > Thanks so much!
> >
> > 	[[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list