[R] Extract from a text file

Jeff Newmiller jdnewmil at dcn.davis.ca.us
Tue May 31 08:12:48 CEST 2016


Please learn to post in plain text (the setting is in your email client... 
somewhere), as HTML is "What We See Is Not What You Saw" on this mailing 
list.  In conjunction with that, try reading some of the fine material 
mentioned in the Posting Guide about making reproducible examples like 
this one:

# You could read in a file
# indta <- readLines( "out.txt" )
# but there is no "current directory" in an email
# so here I have used the dput() function to make source code
# that creates a self-contained R object

indta <- c(
"Mean of weight  group 1, SE of mean  :  72.289037489555276",
" 11.512956539215610",
"Average weight of group 2, SE of Mean :  83.940053900595013",
"  10.198495690144522",
"group 3 mean , SE of Mean     :                78.310441258245469",
" 13.015876679555",
"Mean of weight of group 4, SE of Mean               : 76.967516495101669",
" 12.1254882985", "")

# Regular expression patterns are discussed all over the internet
# in many places OTHER than R
# You can start with ?regex, but there are many fine tutorials also

pattern <- "^.*group (\\d+)[^:]*: *([-+0-9.eE]*).*$"
# For this task the regex has to match the whole "first line" of each set
#  ^ =match starting at the beginning of the string
#  .* =any character, zero or more times
#  "group " =match these characters
#  ( =first capture string starts here
#  \\d = any digit (first backslash for R, second backslash for regex)
#  + =one or more of the preceding (any digit)
#  ) =end of first capture string
#  [^:] =any non-colon character
#  * =zero or more of the preceding (non-colon character)
#  : =match a colon exactly
#  " *" =match zero or more spaces
#  ( =second capture string starts here
#  [ =start of a set of equally acceptable characters
#  -+ =either of these characters are acceptable
#  0-9 =any digit would be acceptable
#  . =a period is acceptable (this is inside the [])
#  eE =in case you get exponential notation input
#  ] =end of the set of acceptable characters (number)
#  * =number of acceptable characters can be zero or more
#  ) =second capture string stops here
#  .* =zero or more of any character (just in case)
#  $ =at end of pattern, requires that the match reach the end
#     of the string

# identify indexes of strings that match the pattern
firstlines <- grep( pattern, indta )
# Replace the matched portion (entire string) with the first capture 
# string
v1 <- as.numeric( sub( pattern, "\\1", indta[ firstlines ] ) )
# Replace the matched portion (entire string) with the second capture 
# string
v2 <- as.numeric( sub( pattern, "\\2", indta[ firstlines ] ) )
# Convert the lines just after the first lines to numeric
v3 <- as.numeric( indta[ firstlines + 1 ] )
# put it all into a data frame
result <- data.frame( Group = v1, Mean = v2, SE = v3 )

Figuring out how to deliver your result (output) is a separate question 
that depends where you want it to go.

On Mon, 30 May 2016, Val wrote:

> Hi all,
>
> I have a messy text file and from this text file I want extract some
> information
> here is the text file (out.txt).  One record has tow lines. The mean comes
> in the first line and the SE of the mean is on the second line. Here is the
> sample of the data.
>
> Mean of weight  group 1, SE of mean  :  72.289037489555276
> 11.512956539215610
> Average weight of group 2, SE of Mean :  83.940053900595013
>  10.198495690144522
> group 3 mean , SE of Mean     :                78.310441258245469
> 13.015876679555
> Mean of weight of group 4, SE of Mean               : 76.967516495101669
> 12.1254882985
>
> I want produce the following  table. How do i read it first and then
> produce a
>
>
> Gr1  72.289037489555276   11.512956539215610
> Gr2  83.940053900595013   10.198495690144522
> Gr3  78.310441258245469   13.015876679555
> Gr4  76.967516495101669   12.1254882985
>
>
> Thank you in advance
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

---------------------------------------------------------------------------
Jeff Newmiller                        The     .....       .....  Go Live...
DCN:<jdnewmil at dcn.davis.ca.us>        Basics: ##.#.       ##.#.  Live Go...
                                       Live:   OO#.. Dead: OO#..  Playing
Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
/Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k



More information about the R-help mailing list