[R] Separating a Complicated String Vector
John Posner
john.posner at MJBIOSTAT.COM
Sun Jan 4 16:43:06 CET 2015
I'm coming to R from Python, so I coded a Python3 solution:
#####################
data = """alabama
bates
tuscaloosa
smith
arkansas
fayette
little rock
alaska
juneau
nome
""".split()
state_list = ["alabama", "arkansas", "alaska"] # etc.
return_list = []
for word in data:
if word in state_list:
current_state = word
else:
return_list.append([current_state, word])
print(return_list)
#####################
... and then translated it to R:
#####################
data = "alabama
bates
tuscaloosa
smith
arkansas
fayette
little rock
alaska
juneau
nome
"
data = strsplit(data, split="\n")[[1]]
states = vector()
cities = vector()
for (word in data) {
if (word %in% tolower(state.name)) {
current_state = word
} else {
states = c(states, current_state)
cities = c(cities, word)
}
}
print(data.frame(V1=states, V2=cities))
#####################
-John
> -----Original Message-----
> From: R-help [mailto:r-help-bounces at r-project.org] On Behalf Of David
> Winsemius
> Sent: Sunday, January 04, 2015 2:48 AM
> To: npretnar
> Cc: R-help at r-project.org
> Subject: Re: [R] Separating a Complicated String Vector
>
>
> On Jan 3, 2015, at 9:20 PM, npretnar wrote:
>
> > Sorry. Bad example on my part. Try this. V1 is ...
> >
> > V1
> > alabama
> > bates
> > tuscaloosa
> > smith
> > arkansas
> > fayette
> > little rock
> > alaska
> > juneau
> > nome
> >
> > And I want:
> >
> > V1 V2
> > alabama bates
> > alabama tuscaloosa
> > alabama smith
> > arkansas fayette
> > arkansas little rock
> > alaska juneau
> > alaskas nome
>
>
> dat$is_state <- grepl(tolower(paste(state.name, collapse="|")), dat$V1)
>
> dat$thisstate <- cumsum(rownames(dat) %in% which(dat$is_state) )
> dat2 <- data.frame(V1 = dat$V1[dat$is_state][dat$thisstate[!dat$is_state] ]
> ,
> V2 = dat$V1[ !dat$is_state] )
>
>
> > dat2
> V1 V2
> 1 alabama bates
> 2 alabama tuscaloosa
> 3 alabama smith
> 4 arkansas fayette
> 5 arkansas little
> 6 arkansas rock
> 7 alaska juneau
> 8 alaska nome
>
> --
> David.
>
> >
> > This is more representative of the problem, extended to all 50 states.
> >
> > - Nick
> >
> >
> > On Jan 3, 2015, at 9:22 PM, Ista Zahn wrote:
> >
> >> I'm not sure what's so complicated about that (am I missing
> >> something?). You can search using grep, and replace using gsub, so
> >>
> >> tmpDF <- read.table(text="V1 V2
> >> A 5
> >> a1 1
> >> a2 1
> >> a3 1
> >> a4 1
> >> a5 1
> >> B 4
> >> b1 1
> >> b2 1
> >> b3 1
> >> b4 1",
> >> header=TRUE)
> >> tmpDF <- tmpDF[grepl("[0-9]", tmpDF$V1), ] data.frame(tmpDF, V3 =
> >> toupper(gsub("[0-9]", "", tmpDF$V1)))
> >>
> >> Seems to do the trick.
> >>
> >> Best,
> >> Ista
> >>
> >> On Sat, Jan 3, 2015 at 9:41 PM, npretnar <npretnar at gmail.com> wrote:
> >>> I have a string variable (V1) in a data frame structured as follows:
> >>>
> >>> V1 V2
> >>> A 5
> >>> a1 1
> >>> a2 1
> >>> a3 1
> >>> a4 1
> >>> a5 1
> >>> B 4
> >>> b1 1
> >>> b2 1
> >>> b3 1
> >>> b4 1
> >>>
> >>> I want the following:
> >>>
> >>> V1 V2 V3
> >>> a1 1 A
> >>> a2 1 A
> >>> a3 1 A
> >>> a4 1 A
> >>> a5 1 A
> >>> b1 1 B
> >>> b2 1 B
> >>> b3 1 B
> >>> b4 1 B
> >>>
> >>> I am not sure how to go about making this transformation besides
> writing a long vector that contains each of the categorical string names
> (these
> are state names, so it would be a really long vector). Any help would be
> greatly appreciated.
> >>>
> >>> Thanks,
> >>>
> >>> Nicholas Pretnar
> >>> Mizzou Economics Grad Assistant
> >>> npretnar at gmail.com
>
>
> David Winsemius
> Alameda, CA, USA
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-
> guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help
mailing list