[R] Split strings based on multiple patterns (plain text)

David Winsemius dwinsemius at comcast.net
Sat Oct 15 08:40:49 CEST 2016


> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote:
> 
> Hopefully this looks better. I did not realize gmail default was html.
> 
> I have a dataframe with a column that has many field smashed together.
> I need to split the strings in the column into separate columns based
> on patterns.
> 
> Example of a string that needs to be split:
> 
> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
> clear: Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
> amphibians observed")
> ugly
> 
> Far as I can tell, there is not a single pattern that would work for
> splitting. Splitting on ":" is close, but not quite right. Each of the
> below attributes should be in a separate column, and are present in
> the string (above) that needs to be split:
> 
> attributes <- c("Water temp", "Waterbody type", "Water pH",
> "Conductivity", "Water color", "Water turbidity", "Manmade",
> "Permanence", "Max water depth", "Primary substrate", "Evidence of
> cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish
> species")
> 
> Conceptually, I want to use the vector of attributes to split the
> string. However, strsplit only uses the 1st value of the attributes
> object:
> 
> strplit(ugly, attributes).

I tried this:

strsplit( ugly, split=paste0(attributes, collapse="|")  )

And noticed soem of hte attributes were not actually splitting so went back and did the data entry after making sure that there were no "\n"'s in the middle of attribute names:

dput(attributes)
c("Water temp", "Waterbody type", "Water pH", "Conductivity", 
"Water color", "Water turbidity", "Manmade", "Permanence", "Max water depth", 
"Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent Veg(%)", 
"Fish present", "Fish species")

strsplit( ugly, split=paste0(attributes, collapse="|")  )
[[1]]
 [1] ""                                                                                                        
 [2] ":14: F "                                                                                                 
 [3] ":Permanent Lake/Pond: Water\npH:Unkwn: "                                                                 
 [4] ":Unkwn: "                                                                                                
 [5] ": Clear: "                                                                                               
 [6] ":\nclear: "                                                                                              
 [7] ":no  "                                                                                                   
 [8] ":permanent:  "                                                                                           
 [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline\nEmergent Veg(%): 1-25: "
[10] ": yes: Fish species: unkwn: no\namphibians observed"        

> 
> Should I loop through the values of "attributes"?
> Is there an argument in strsplit I'm missing that will do what I want? \\

I don't think strsplit has such an argument. There may be packages that will support this. Perhaps the gubfn package?


> Different approach altogether?
> 
> Thanks! Happy Friday.
> Joe
> 
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list