[R] reading web log file into R

Jay Emerson jayemerson at gmail.com
Wed Sep 23 14:32:11 CEST 2009


Sebastian,

There is rarely a completely free lunch, but fortunately for us R has
some wonderful tools
to make this possible.  R supports regular expressions with commands
like grep(),
gsub(), strsplit(), and others documented on the help pages.  It's
just a matter of
constructing and algorithm that does the job.  In your case, for
example (though please
note there are probably many different, completely reasonable approaches in R):

x <- scan("logfilename", what="", sep="\n")

should give you a vector of character strings, one line per element.  Now, lines
containing "GET" seem to identify interesting lines, so

x <- x[grep("GET", x)]

should trim it to only the interesting lines.  If you want information
from other lines, you'll
have to treat them separately.  Next, you might try

y <- strsplit(x)

which by default splits on whitespace, returning a list (one component
per line) of vectors
based on the split.  Try it.  It it looks good, you might check

lapply(y, length)

to see if all lines contain the same number of records.  If so, you
can then get quickly into
a matrix,

z <- matrix(unlist(strsplit(x)), ncol=K, byrow=TRUE)

where K is the common length you just observed.  If you think this is
cool, great!  If not, well...
hire a programmer, or if you're lucky Microsoft or Apache have tools
to help you with this.
There might be something in the Perl/Python world.  Or maybe there's a
package in R designed
just for this, but I encourage students to develop the raw skills...

Jay



-- 
John W. Emerson (Jay)
Associate Professor of Statistics
Department of Statistics
Yale University
http://www.stat.yale.edu/~jay




More information about the R-help mailing list