[R] Split strings based on multiple patterns (plain text)

Sat Oct 15 22:32:18 CEST 2016

Thank you David Wolfskill, David Winsemius, and Gabor! All very
helpful and interesting fixes for the problem (compiled below)! Now I
will see which one works best on the 944 rows that each have a cell of
smooshed attributes...the attribute names should be the same in all
the rows, if there is any mercy :)

Joe Ceradini
University of Wyoming

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/14/16, David Wolfskill <david at catwhisker.org> wrote:
> Happy Friday, indeed.
>
> It seems to me that the data need a bit of cleamup before attempting to
> parse -- for example, that "F" looks to be improperly delimited by ':'
> on either side.  I can't tell from a single example if that's typical
> (either for that field, or for random fields throughout the complete
> dataset).  On the off-chance it's the former, here's a bit of exercise
> that may lead you a bit closer to a solution:
>
> First, starting with "ugly":
>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity: clear:
>> Manmade:no  Permanence:permanent:  Max water depth: <3: Primary substrate:
>> Silt/Mud: Evidence of cattle grazing: none: Shoreline Emergent Veg(%):
>> 1-25: Fish present: yes: Fish species: unkwn: no amphibians observed")
>> ugly
> [1] "Water temp:14: F Waterbody type:Permanent Lake/Pond: Water pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent:  Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # First, see what a naive strsplit() does:
>
>> strsplit(ugly, ":")
> [[1]]
>  [1] "Water temp"                  "14"
>  [3] " F Waterbody type"           "Permanent Lake/Pond"
>  [5] " Water pH"                   "Unkwn"
>  [7] " Conductivity"               "Unkwn"
>  [9] " Water color"                " Clear"
> [11] " Water turbidity"            " clear"
> [13] " Manmade"                    "no  Permanence"
> [15] "permanent"                   "  Max water depth"
> [17] " <3"                         " Primary substrate"
> [19] " Silt/Mud"                   " Evidence of cattle grazing"
> [21] " none"                       " Shoreline Emergent Veg(%)"
> [23] " 1-25"                       " Fish present"
> [25] " yes"                        " Fish species"
> [27] " unkwn"                      " no amphibians observed"
>
> # OK; let's fix the "F":
>
>> ugly1 <- sub(": F ", "F: ", ugly)
>> ugly1
> [1] "Water temp:14F: Waterbody type:Permanent Lake/Pond: Water pH:Unkwn:
> Conductivity:Unkwn: Water color: Clear: Water turbidity: clear: Manmade:no
> Permanence:permanent:  Max water depth: <3: Primary substrate: Silt/Mud:
> Evidence of cattle grazing: none: Shoreline Emergent Veg(%): 1-25: Fish
> present: yes: Fish species: unkwn: no amphibians observed"
>
> # Now, that substring "Manmade:no  Permanence:permanent:" is problematic;
> # the "  " in there should apparently be ": " -- but we can't just do that
> # to all "  " substrings, because that would also affect
> # "Permanence:permanent:  Max water depth: <3:" -- the differnce, though,
> # is that the one we don't want to change contains ":  ", so let's change
> # those.  I'm assuming(!) that we don't really care about leading or
> # trailing spaces in the fields:
>
>> ugly2 <- gsub(" *: *", ":", ugly1)
>> ugly2
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no  Permanence:permanent:Max water depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed"
>
> # Now that "  " shows up like a sore thumb.  Just to make the point even
> # clearer, try the "naive" strsplit on what we have:
>
>> strsplit(ugly2, ":")
> [[1]]
>  [1] "Water temp"                 "14F"
>  [3] "Waterbody type"             "Permanent Lake/Pond"
>  [5] "Water pH"                   "Unkwn"
>  [7] "Conductivity"               "Unkwn"
>  [9] "Water color"                "Clear"
> [11] "Water turbidity"            "clear"
> [13] "Manmade"                    "no  Permanence"
> [15] "permanent"                  "Max water depth"
> [17] "<3"                         "Primary substrate"
> [19] "Silt/Mud"                   "Evidence of cattle grazing"
> [21] "none"                       "Shoreline Emergent Veg(%)"
> [23] "1-25"                       "Fish present"
> [25] "yes"                        "Fish species"
> [27] "unkwn"                      "no amphibians observed"
>
>>
>
> # Note element [14]:  that's the one we need to fix.  I'll assume(!)
> # that that sort of thing may occur just about anywhere, so let's just
> # whack 'em all:
>
>> ugly3 <- gsub("  ", ":", ugly2)
>> ugly3
> [1] "Water temp:14F:Waterbody type:Permanent Lake/Pond:Water
> pH:Unkwn:Conductivity:Unkwn:Water color:Clear:Water
> turbidity:clear:Manmade:no:Permanence:permanent:Max water depth:<3:Primary
> substrate:Silt/Mud:Evidence of cattle grazing:none:Shoreline Emergent
> Veg(%):1-25:Fish present:yes:Fish species:unkwn:no amphibians observed"
>
> # Again, check a naive strsplpit():
>
>> strsplit(ugly3, ":")
> [[1]]
>  [1] "Water temp"                 "14F"
>  [3] "Waterbody type"             "Permanent Lake/Pond"
>  [5] "Water pH"                   "Unkwn"
>  [7] "Conductivity"               "Unkwn"
>  [9] "Water color"                "Clear"
> [11] "Water turbidity"            "clear"
> [13] "Manmade"                    "no"
> [15] "Permanence"                 "permanent"
> [17] "Max water depth"            "<3"
> [19] "Primary substrate"          "Silt/Mud"
> [21] "Evidence of cattle grazing" "none"
> [23] "Shoreline Emergent Veg(%)"  "1-25"
> [25] "Fish present"               "yes"
> [27] "Fish species"               "unkwn"
> [29] "no amphibians observed"
>
>>
>
> # OK; not what we want, but it's a lot closer.  Now, watch this:
>
>> ugly4 <- gsub("([^:]*:[^:]*): *", "\\1\001", ugly3, perl = TRUE)
>> strsplit(ugly4, "\001")
> [[1]]
>  [1] "Water temp:14F"                     "Waterbody type:Permanent
> Lake/Pond"
>  [3] "Water pH:Unkwn"                     "Conductivity:Unkwn"
>
>  [5] "Water color:Clear"                  "Water turbidity:clear"
>
>  [7] "Manmade:no"                         "Permanence:permanent"
>
>  [9] "Max water depth:<3"                 "Primary substrate:Silt/Mud"
>
> [11] "Evidence of cattle grazing:none"    "Shoreline Emergent Veg(%):1-25"
>
> [13] "Fish present:yes"                   "Fish species:unkwn"
>
> [15] "no amphibians observed"
>
>>
>
> # At this point, at least elements [1] - [14] are each of the form
> # "tag:value", and thus, readily parsable.  Element [15] appears to be
> # a somewhat-random comment; I suppose you could check for elements that
> # lack a (single) ':' and treat them "specially"....
>
> I hope that helps.  Good luck!
>
> Peace,
> david
> --
> David H. Wolfskill				david at catwhisker.org
> Those who would murder in the name of God or prophet are blasphemous
> cowards.
>
> See http://www.catwhisker.org/~david/publickey.gpg for my public key.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/15/16, Gabor Grothendieck <ggrothendieck at gmail.com> wrote:
> Replace newlines and colons with a space since they seem to be junk,
> generate a pattern to replace the attributes with a comma and do the
> replacement and finally read in what is left into a data frame using
> the attributes as column names.
>
> (I have indented each line of code below by 2 spaces so if any line
> starts before that then it's been wrapped around by the email and
> needs to be adjusted.)
>
>   attributes <-
>   c("Water temp", "Waterbody type", "Water pH", "Conductivity",
>   "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth",
>   "Primary substrate", "Evidence of cattle grazing", "Shoreline
> Emergent Veg(%)",
>   "Fish present", "Fish species")
>
>   ugly2 <- gsub("[:\n]", " ", ugly)
>
>   pat <- paste(gsub("([[:punct:]])", ".", attributes), collapse = "|")
>   ugly3 <- gsub(pat, ",", ugly2)
>
>   dd <- read.table(text = ugly3, sep = ",", strip.white = TRUE,
> col.names = c("", attributes))[-1]

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

On 10/15/16, David Winsemius <dwinsemius at comcast.net> wrote:
>
>> On Oct 14, 2016, at 6:53 PM, Joe Ceradini <joeceradini at gmail.com> wrote:
>>
>> Hopefully this looks better. I did not realize gmail default was html.
>>
>> I have a dataframe with a column that has many field smashed together.
>> I need to split the strings in the column into separate columns based
>> on patterns.
>>
>> Example of a string that needs to be split:
>>
>> ugly <- c("Water temp:14: F Waterbody type:Permanent Lake/Pond: Water
>> pH:Unkwn: Conductivity:Unkwn: Water color: Clear: Water turbidity:
>> clear: Manmade:no  Permanence:permanent:  Max water depth: <3: Primary
>> substrate: Silt/Mud: Evidence of cattle grazing: none: Shoreline
>> Emergent Veg(%): 1-25: Fish present: yes: Fish species: unkwn: no
>> amphibians observed")
>> ugly
>>
>> Far as I can tell, there is not a single pattern that would work for
>> splitting. Splitting on ":" is close, but not quite right. Each of the
>> below attributes should be in a separate column, and are present in
>> the string (above) that needs to be split:
>>
>> attributes <- c("Water temp", "Waterbody type", "Water pH",
>> "Conductivity", "Water color", "Water turbidity", "Manmade",
>> "Permanence", "Max water depth", "Primary substrate", "Evidence of
>> cattle grazing", "Shoreline Emergent Veg(%)", "Fish present", "Fish
>> species")
>>
>> Conceptually, I want to use the vector of attributes to split the
>> string. However, strsplit only uses the 1st value of the attributes
>> object:
>>
>> strplit(ugly, attributes).
>
> I tried this:
>
> strsplit( ugly, split=paste0(attributes, collapse="|")  )
>
> And noticed soem of hte attributes were not actually splitting so went back
> and did the data entry after making sure that there were no "\n"'s in the
> middle of attribute names:
>
> dput(attributes)
> c("Water temp", "Waterbody type", "Water pH", "Conductivity",
> "Water color", "Water turbidity", "Manmade", "Permanence", "Max water
> depth",
> "Primary substrate", "Evidence of cattle grazing", "Shoreline Emergent
> Veg(%)",
> "Fish present", "Fish species")
>
> strsplit( ugly, split=paste0(attributes, collapse="|")  )
> [[1]]
>  [1] ""
>
>  [2] ":14: F "
>
>  [3] ":Permanent Lake/Pond: Water\npH:Unkwn: "
>
>  [4] ":Unkwn: "
>
>  [5] ": Clear: "
>
>  [6] ":\nclear: "
>
>  [7] ":no  "
>
>  [8] ":permanent:  "
>
>  [9] ": <3: Primary\nsubstrate: Silt/Mud: Evidence of cattle grazing: none:
> Shoreline\nEmergent Veg(%): 1-25: "
> [10] ": yes: Fish species: unkwn: no\namphibians observed"
>
>>
>> Should I loop through the values of "attributes"?
>> Is there an argument in strsplit I'm missing that will do what I want? \\
>
> I don't think strsplit has such an argument. There may be packages that will
> support this. Perhaps the gubfn package?
>
>
>> Different approach altogether?
>>
>> Thanks! Happy Friday.
>> Joe
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius
> Alameda, CA, USA
>