[R] Narrowing values collected from .txt file

Morway, Eric emorway at usgs.gov
Wed Aug 28 20:28:11 CEST 2013


A relatively concise, commented, working solution to the problem
originally motivating
this thread was found (below).  I suspect the approach I've taken has a
major inefficiency through the use of the "scan" statement appearing inside
the function "g".  The way the code works right now, it has to re-open and
read the file 'length(matched) times' rather than sequentially reading
through to the next pertinent section of the txt file.  Does anyone have a
more efficient approach in mind so I don't have to wait 1/2 hour to get the
results? (The only adjustment to the code that follows is to point "txt" to
wherever the attached file is placed)

#The file that the code works on is attached as:  MCR_Budgets.zip (76MB
uncompressed)

# where is the file? (original dat file is ~147MB, only half of this file
is attached)
txt<-"c:/temp/MCR_Budgets.txt"

# Demarcation header to narrow list of retrieved 'Recharge' values
hdr_str<-"Flow Budget for Zone  2"

# string to identify lines with desired values
srch_str<-"  RECHARGE ="

# retrieves desired values
g<-function(txt_con, hdr_str, srch_str, from, to, ...) {

    L <- readLines(txt_con)

    #matched contains the line #s w/ hdr_str
    matched <- grep(hdr_str, L, value = FALSE, ...)

    #initialize output list
    fetched_list<-numeric()

    #for each instance of hdr_str, loop
    for(i in 1:(length(matched))){

      #retrieve a section of text following each hdr_str, suspect this
is highly
inefficient!!!
      snippet<-scan(txt_con, what=character(), skip=matched[i]-1, n=42,
sep='\n')

      #get the two lines containing 'srch_str' within the short
section of retrieved
text
      fetched <- grep(srch_str, snippet, value=TRUE)

      #append output vector for plotting time series
      fetched_list <- c(fetched_list, as.numeric(substring(fetched, from,
to)))

      #monitor
      print(i)
    }

    #return desired values
    as.numeric(fetched_list)
}

#The results of system.time reflect the fact the function was run on the full
147 MB file,
# only half of which is attached.
system.time(
  rech_z2<-g(txt,hdr_str,srch_str,37,51)
)
#   user  system elapsed
#1740.48   36.08 1825.77




On Wed, Aug 21, 2013 at 6:50 AM, Morway, Eric

>
> The output generated from a groundwater model post-processor contains
> millions of lines of text.  Using the custom R function shown below, I
> can quickly gather values from this file.
>
> As you can see in the textConnection provided below (which is only a
> small snippet from the file), the output is repetitive but does have some
> header lines I hope to make use of to narrow the collected output.  The
> header lines I'm speaking of are:
> 1) "     Flow Budget for Zone  1 at Time Step   1 of Stress Period   2"
> 2) "     Flow Budget for Zone  2 at Time Step   1 of Stress Period   2"
> 3) "     Flow Budget for Zone  3 at Time Step   1 of Stress Period   2"
> 4) "     Flow Budget for Zone  1 at Time Step   1 of Stress Period   3"
> ...
>
> and so on for 111 different "zones" as well as 575 distinct "stress
> periods".  In the custom function that follows, currently named "g", I can
> collect all values of "Recharge".  If instead I want to restrict the
> collected "Recharge" values to "Zone 2" for all 575 stress periods, is
> there a way to first look for the header "Flow Budget for Zone  2", collect
> only the next two values of Recharge, and then skip down to the next
> header containing "Zone 2", collect 2 more values of "Recharge", and on
> like this to the end?  'Peeling' out targeted flow budget terms will
> facilitate generation of budget-specific plots through time.
>
> The "edm" variable at the end of the R code that follows currently looks
> like this:
> edm
> # [1] 1.28980e+05 0.00000e+00 *2.74161e-01* 0.00000e+00 8.10840e+04
> 0.00000e+00
> # [7] 1.28980e+05 0.00000e+00 *2.74165e-01* 0.00000e+00 8.10840e+04
> 0.00000e+00
>
> but with the proposed revision, which only collects Recharge values from
> Zone 2, it would look like:
> edm
> # [1] *2.74161e-01* 0.00000e+00 *2.74165e-01* 0.00000e+00
>
>
> txt_con<-textConnection(" mark_zone
>
>
>      Flow Budget for Zone  1 at Time Step   1 of Stress Period   2
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =   0.37855E-02
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =   0.12898E+06
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  16 to   1 =    0.0000
>                    Zone  31 to   1 =    0.0000
>                    Zone  40 to   1 =    0.0000
>                    Zone  91 to   1 =    0.0000
>
>                           Total IN =   0.12898E+06
>
>              OUT:
>              ----
>                            STORAGE =   0.58275E-04
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   1 to  16 =    399.88
>                    Zone   1 to  31 =    85204.
>                    Zone   1 to  40 =    12404.
>                    Zone   1 to  91 =    30968.
>
>                          Total OUT =   0.12898E+06
>
>                           IN - OUT =   0.14138E-03
>
>                Percent Discrepancy =                0.00
> 1
>  mark_zone
>
>
>      Flow Budget for Zone  2 at Time Step   1 of Stress Period   2
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =   0.18833E-05
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =   0.274161E+06
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  15 to   2 =    0.0000
>                    Zone  31 to   2 =    0.0000
>                    Zone  91 to   2 =    13134.
>
>                           Total IN =   0.28729E+06
>
>              OUT:
>              ----
>                            STORAGE =   0.10823E-04
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   2 to  15 =    6812.7
>                    Zone   2 to  31 =   0.20820E+06
>                    Zone   2 to  91 =    72274.
>
>                          Total OUT =   0.28729E+06
>
>                           IN - OUT =   0.58504E-02
>
>                Percent Discrepancy =                0.00
> 1
>  mark_zone
>
>
>      Flow Budget for Zone  3 at Time Step   1 of Stress Period   2
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =   0.84894E-04
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    81084.
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  31 to   3 =    0.0000
>                    Zone  91 to   3 =    1234.9
>
>                           Total IN =    82319.
>
>              OUT:
>              ----
>                            STORAGE =    0.0000
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   3 to  31 =    53937.
>                    Zone   3 to  91 =    28382.
>
>                          Total OUT =    82319.
>
>                           IN - OUT =   0.81732E-03
>
>                Percent Discrepancy =                0.00
> 1
>  mark_zone
>
>
>      Flow Budget for Zone  1 at Time Step   1 of Stress Period   3
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =   0.15770E-04
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =   0.12898E+06
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  16 to   1 =    0.0000
>                    Zone  31 to   1 =    0.0000
>                    Zone  40 to   1 =    0.0000
>                    Zone  91 to   1 =    0.0000
>
>                           Total IN =   0.12898E+06
>
>              OUT:
>              ----
>                            STORAGE =   0.38262E-02
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   1 to  16 =    399.88
>                    Zone   1 to  31 =    85214.
>                    Zone   1 to  40 =    12405.
>                    Zone   1 to  91 =    30958.
>
>                          Total OUT =   0.12898E+06
>
>                           IN - OUT =   0.88928E-03
>
>                Percent Discrepancy =                0.00
> 1
>  mark_zone
>
>
>      Flow Budget for Zone  2 at Time Step   1 of Stress Period   3
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =    0.0000
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =   0.274165E+06
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  15 to   2 =    0.0000
>                    Zone  31 to   2 =    0.0000
>                    Zone  91 to   2 =    13215.
>
>                           Total IN =   0.28737E+06
>
>              OUT:
>              ----
>                            STORAGE =   0.27267E-02
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   2 to  15 =    6813.6
>                    Zone   2 to  31 =   0.20827E+06
>                    Zone   2 to  91 =    72291.
>
>                          Total OUT =   0.28737E+06
>
>                           IN - OUT =   0.69125E-03
>
>                Percent Discrepancy =                0.00
> 1
>  mark_zone
>
>
>      Flow Budget for Zone  3 at Time Step   1 of Stress Period   3
>      -------------------------------------------------------------
>
>                        Budget Term     Flow (L**3/T)
>                        -----------------------------
>
>              IN:
>              ---
>                            STORAGE =    0.0000
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    81084.
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone  31 to   3 =    0.0000
>                    Zone  91 to   3 =    1262.7
>
>                           Total IN =    82346.
>
>              OUT:
>              ----
>                            STORAGE =   0.18113E-03
>                      CONSTANT HEAD =    0.0000
>                           RECHARGE =    0.0000
>                     STREAM LEAKAGE =    0.0000
>                      LAKE  SEEPAGE =    0.0000
>                             UZF ET =    0.0000
>                              GW ET =    0.0000
>                       UZF INFILTR. =    0.0000
>                   SFR-DIV. INFLTR. =    0.0000
>                       UZF RECHARGE =    0.0000
>                    SURFACE LEAKAGE =    0.0000
>                    Zone   3 to  31 =    53843.
>                    Zone   3 to  91 =    28503.
>
>                          Total OUT =    82346.
>
>                           IN - OUT =  -0.14018E-02
>
>                Percent Discrepancy =                0.00
> ")
>
>
> g<-function(txt_con, string, from, to, ...) {
>         L <- readLines(txt_con)
>         matched <- grep(string, L, value = TRUE, ...)
>         as.numeric(substring(matched, from, to))
> }
>
> #Now, strip out values
> edm<-g(txt_con, "  RECHARGE =", 37, 50)
>
>
>


More information about the R-help mailing list