[R] help reading a variably formatted text file

Tue Nov 19 21:41:34 CET 2002

On Tue, 19 Nov 2002, Corey Moffet stated:

>  Dear R-Help,
>  
>  I have a generated file that looks like the following:
>  ....
>  Is this a reasonable thing to do in R?  Are there some functions that will
>  make this task less difficult?  Is there a function that alows you to read
>  a small amount of information, parse it, test it, and then begin reading
>  again where it left off?

This function seems to work, on your sample file at least,

read.hill <- function (file)
{
    lines <- scan (file, what = "", sep = "\n", quiet = TRUE)
    ## Get the line starting with ' char'
    chars <- grep ("^ char", lines)
    ## Get the number of columns
    ncols <- get.numbers (lines[chars])
    ## Get the column labels
    labels <- lines[rep (chars, ncols) +
                    as.vector (sapply (ncols, seq, from = 1))]
    ##
    days.col <- grep ("Days", labels)
    runoff.col <- grep ("Runoff", labels)
    ## Get the numbers 
    toSkip <- grep ("Daily values", lines) + 1
    toRead  <- grep ("Minimum/Maximum", lines) - 2 - toSkip
    temp <- unlist (strsplit (lines[(toSkip+1):(toSkip+toRead)],
                              split = " +"))
    ## There are some "" at the first column
    temp <- matrix (temp, ncol = length (labels) + 1, byrow = TRUE)
    data.frame (days = as.numeric (temp[, days.col + 1]),
                runoff = as.numeric (temp[, runoff.col + 1]))
}

get.numbers () is a function that I wrote to extract numbers from a character
vector that match a certain pattern.

get.numbers <- function (ss, pattern, ignore.case = FALSE) {
    if (!missing (pattern)) {
        ss <- grep (pattern, x = ss, ignore.case = ignore.case,
                    extended = TRUE, value = TRUE)
    }
    if (length (ss) == 0) {
        return (NULL)
    }
    ## split at non numeric, non-dot characters and two or more dots
    ## FIXME: this is not the optimal split
    token <- strsplit (ss, split = "([^-+.0-9]|--+|\\+\\++|\\.\\.+| \t)")
    ## remove any trailing '.'
    token <- lapply (token, function (x) sub ("\\.$", "", x))
    ## remove empty strings and convert to numeric
    token <- lapply (token, function (x) {
        as.numeric (x[sapply (x, function (y) y != "")])
    })
    if (is.null (names (ss))) {
        names (token) <- ss
    } else {
        names (token) <- names (ss)
    }
    token
}

As a test:

> read.hill ("hillslope.dat")
   days runoff
1     1      0
2     2      0
3     3      0
4     4      0
5     5      0
6     6      0
7     7      0
8     8      0
9     9      0
10   10      0
11   11      0
12   12      0
13   13      0
14   14      0
15   15      0
16   16      0
17   17      0
18   18      0
19   19      0
20   20      0
21   21      0
22   22      0
23   23      0
24   24      0
25   25      0
26   26      0
27   27      0
28   28      0
29   29      0
30   30      0
31   31      0

As Jason pointed out, Perl might be more suitable to this job.  However, I do
like using R to parse many weird files.  I find maintaining R scripts much
easier than Perl and it is often more convenient to read a file directly into
R. 

It would be nice to have more powerful regex in R, such as returning matched
substring grouped with "()".

Michael

-- 
----------------------------------------------------------------------------
Michael Na Li                               
Email: lina at u.washington.edu
Department of Biostatistics, Box 357232
University of Washington, Seattle, WA 98195  
---------------------------------------------------------------------------

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-help mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-help-request at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._