[R] split character vector by multiple keywords simultaneously

sunny sunayan at gmail.com
Sun May 8 13:43:26 CEST 2011


Andrew Robinson-6 wrote:
> 
> A hack would be to use gsub() to prepend e.g. XXX to the keywords that
> you want, perform a strsplit() to break the lines into component
> strings, and then substr() to extract the pieces that you want from
> those strings.
> 
> Cheers
> 
> Andrew
> 

Thanks, that got me started. I am sure there are much easier ways of doing
this, but in case someone comes looking, here's my solution:

keywordlist <- c("Company name:", "General manager:", "Manager:")

# Attach "XXX" to the beginning of each keyword:
for (i in 1:length(keywordlist)) {
temp <- gsub(keywordlist[i],paste("XXX",keywordlist[i],sep=""),temp)
}

# Split each row into a list:
temp <- strsplit(temp,"XXX")
# Eliminate empty elements:
temp <- lapply(temp, function(x) x[which(x!='')])

# Since each keyword happens to include a colon at the end, split each list
element generated above into exactly two parts, pre-colon for the keyword
and post-colon for the value. Since values may contain colons themselves,
avoid spurious matches by using n=2 in str_split_fixed function from stringr
package:
library(stringr)
temp <- lapply(temp,function(x) str_split_fixed(x,':',n=2))

# Convert each list element into a data frame. The transpose makes sure that
the first row of each data frame is the set of keywords. Each data frame has
2 rows - one with the keywords and the second with the values:
temp <- lapply(temp, function(x) replace(as.data.frame(t(x)),,t(x)))

# Copy the first row of each data frame to the name of the corresponding
column:
for (i in 1:length(temp)) {
names(temp[[i]]) <- as.character(temp[[i]][1,])
}

# Now join all the data frames in the list by column names. This way it
doesn't matter if some keywords are absent in some cases:
final_data <- do.call(rbind.fill,temp)

# We now have one large data frame with the odd numbered rows containing the
keywords and the even numbered rows containing the values. Since we already
have the keywords in the name, we can eliminate the odd numbered rows:
final_data <- final_data[seq(2,dim(final_data)[1],2),]

-S.

--
View this message in context: http://r.789695.n4.nabble.com/split-character-vector-by-multiple-keywords-simultaneously-tp3497033p3506776.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list