[R] Grep with wildcards across multiple columns

arun smartpink111 at yahoo.com
Thu Mar 14 23:19:24 CET 2013


HI,

Not sure whether this helps.
If you take out the grep(",par.obj,..), it works without any warning.
eval(parse(text=paste(
  "dt2 <- dt[", "grep('", par.fund, "', fund) & ",
  "grep('", par.func, "', func)",
  ", sum(amount), by=c('code', 'year')]" , sep="")))
 dt[grep('^1.E$',fund) & grep('^1.....$',func),sum(amount),by=c('code','year')]
#   code year     V1
#1: 1001 2011 185482
#2: 1001 2012 189367
#3: 1002 2011 238098
#4: 1002 2012 211499
aggregate(amount~code+year,data=df,sum)
#  code year amount
#1 1001 2011 185482
#2 1002 2011 238098
#3 1001 2012 189367
#4 1002 2012 211499

In the df, you provided, there is only value of obj.
levels(df$obj)
#[1] "100"
A.K.




----- Original Message -----
From: "Bush,  Daniel P.   DPI" <Daniel.Bush at dpi.wi.gov>
To: "'r-help at r-project.org'" <r-help at r-project.org>
Cc: 
Sent: Thursday, March 14, 2013 5:43 PM
Subject: [R] Grep with wildcards across multiple columns

I have a fairly large data set with six variables set up like the following dummy:

# Create fake data
df <- data.frame(code   = c(rep(1001, 8), rep(1002, 8)),
                 year   = rep(c(rep(2011, 4), rep(2012, 4)), 2),
                 fund   = rep(c("10E", "10E", "10E", "27E"), 4),
                 func   = rep(c("110000", "122000", "214000", "158000"), 4),
                 obj    = rep("100", 16),
                 amount = round(rnorm(16, 50000, 10000)))

What I would like to do is sum the amount variable by code and year, filtering rows using different wildcard searches in each of three columns: "1?E" in fund, "1??????" in func, and "???" in obj. I'm OK turning these into regular expressions:

# Set parameters
par.fund <- "10E"; par.func <- "100000"; par.obj <- "000"
par.fund <- glob2rx(gsub("0", "?", par.fund))
par.func <- glob2rx(gsub("0", "?", par.func))
par.obj <- glob2rx(gsub("0", "?", par.obj))

The problem occurs when I try to apply multiple greps across columns. I'd prefer to use data.table since it's so much faster than plyr and I have 159 different sets of parameters to run through, but I get the same error setting it up either way:

# Doesn't work
library(data.table)
dt <- data.table(df)
eval(parse(text=paste(
  "dt2 <- dt[", "grep('", par.fund, "', fund) & ",
  "grep('", par.func, "', func) & grep('", par.obj, "', obj)",
  ", sum(amount), by=c('code', 'year')]" , sep="")))
# Warning message:
#   In grep("^1.E$", fund) & grep("^1.....$", func) :
#   longer object length is not a multiple of shorter object length

# Also doesn't work
library(plyr)
eval(parse(text=paste(
  "df2 <- ddply(df[grep('", par.fund, "', df$fund) & ",
  "grep('", par.func, "', df$func) & grep('", par.obj, "', df$obj), ]",
  ", .(code, year), summarize, amount = sum(amount))" , sep="")))
# Warning message:
#   In grep("^1.E$", df$fund) & grep("^1.....$", df$func) :
#   longer object length is not a multiple of shorter object length

Clearly, the problem is how I'm trying to combine greps in subsetting rows, but I haven't been able to find a solution that works. Any thoughts-preferably something that works with data.table?

DB

Daniel Bush
School Finance Consultant
School Financial Services
Wisconsin Department of Public Instruction
PO Box 7841 | Madison, WI 53707-7841
daniel.bush -at- dpi.wi.gov | sfs.dpi.wi.gov
Ph: 608-267-9212 | Fax: 608-266-2840




    [[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list