[R] Regex query (Apache logs)

Allan Engelhardt allane at cybaea.com
Thu Mar 17 08:12:32 CET 2011


Some comments:

1. [^\s] matches everything up to a literal 's', unless perl=TRUE.
2. The (.*) is greedy, so you'll need (.*?)"\s"(.*?)"\s"(.*?)"$ or 
similar at the end of the expression

With those changes (and removing a space inserted by the newsgroup 
posting) the expression works for me.

 > (pat <- readLines("/tmp/b.txt")[1])
[1] 
"^(\\d{1,3}\\.\\d{1,3}\\.\\d{1,3}\\.\\d{1,3})\\s([^\\s]*)\\s([^\\s]*)\\s\\[([^\\]]+)\\]\\s\"([A-Z]*)\\s([^\\s]*)\\s([^\\s]*)\"\\s([^\\s]+)\\s(\\d+)\\s\"(.*?)\"\\s\"(.*?)\"\\s\"(.*?)\"$"
 > regexpr(pat, test, perl=TRUE)
[1] 1
attr(,"match.length")
[1] 436

3. Consider a different approach, e.g. scan(textConnection(test), 
what=character(0))

Hope this helps

Allan


On 16/03/11 22:18, Saptarshi Guha wrote:
> Hello R users,
>
> I have this regex see [1] for apache log lines. I tried using R to parse
> some data (only because I wanted to stay in R).
> A sample line is [2]
>
> (a) I saved the line in [1] into "~/tmp/a.txt" and [2] into "/tmp/a.txt"
>
> pat<- readLines("~/tmp/a.txt")
> test<- readLines("/tmp/a.txt")
> test
> grep(pat,test)
>
> returns integer(0)
>
> The same query works in python via re.match(....) (i.e does return groups)
>
> Using readLines, the regex is escaped for me. Does Python and R use
> different regex styles?
>
> Cheers
> Saptarshi
>
> [1]
>   ^(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3})\s([^\s]*)\s([^\s]*)\s\[([^\]]+)\]\s"([A-Z]*)\s([^\s]*)\s([^\s]*)"\s([^\s]+)\s(\d+)\s"(.*)"\s"(.*)"\s"(.*)"$
>
> [2]
> 220.213.119.925 addons.mozilla.org - [10/Jan/2001:01:55:07 -0800] "GET
> /blocklist/3/%8ce33983c0-fd0e-11dc-12aa-0800200c9a66%7D/4.0b5/Fennec/20110217140304/Android_arm-eabi-gcc3/chrome:%2F%2Fglobal%2Flocale%2Fintl.properties/beta/Linux%
> 202.6.32.9/default/default/6/6/1/ HTTP/1.1" 200 3243 "-" "Mozilla/5.0
> (Android; Linux armv7l; rv:2.0b12pre) Gecko/20110217 Firefox/4.0b12pre
> Fennec/4.0b5" "BLOCKLIST_v3=110.163.217.169.1299218425.9706"
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list