[R] Search within a file

Tuszynski, Jaroslaw W. JAROSLAW.W.TUSZYNSKI at saic.com
Fri Nov 4 16:53:40 CET 2005


Thanks for a great suggestions. I guess the code you suggested would look
something like this:

fregexpr = function(pattern, filename) 
{ # same as gregexpr but operating on files not strings
  # Only single string 'pattern's allowed 
	buf.size=1024
	n  = file.info(filename)$size
	pos = NULL
	fp = file(filename, "rb")
	for (d in seq(1,n,by=buf.size)) {
		m = if (n-d>buf.size) buf.size else n-d
		p = gregexpr(pattern, readChar(fp, m))[[1]]
		if(p[1]>0) pos=c(pos, p+d-1)
	}
	close(fp)
	if (is.null(pos)) pos=-1
	return (pos)
}


> fname = file.path(R.home(),"COPYING")
> fregexpr("right", fname)
 [1]    73  1347  1422  1460  1727  1879  1908  1939  3106  3350  4240  5530
[13]  6637  6661  6740  9460  9534 10503 11756 12528 12566 13805 15907 16056
[25] 17053 17681 17813
> gregexpr("right", readChar(fname,file.info(fname)$size))[[1]]
 [1]    73  1347  1422  1460  1727  1879  1908  1939  3106  3350  4240  5530
[13]  6637  6661  6740  9460  9534 10503 11756 12528 12566 13805 15907 16056
[25] 17053 17681 17813
attr(,"match.length")
 [1] 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5 5


The function above does what I need, if someone needs a function that
parallels gregexpr but operates on files not strings, than most of the work
would be in modifying line "if(p[1]>0) pos=c(pos, p+d-1)" to do
concatination and addition on lists.

Thanks 

Jarek Tuszynski


-----Original Message-----
From: r-help-bounces at stat.math.ethz.ch
[mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of Seth Falcon
Sent: Friday, November 04, 2005 1:44 AM
To: r-help at stat.math.ethz.ch
Subject: Re: [R] Search within a file


On  3 Nov 2005, JAROSLAW.W.TUSZYNSKI at saic.com wrote:
> I am looking for a way to search a file for position of some 
> expression, from within R. My current code:
>
> sha1Pos = gregexpr("<sha1>", readChar(filename, 
> file.info(filename)$size))[[1]]
>
> Works fine for small files, but text files I will be working with 
> might get up to Gb range, so I was trying to accomplish the same 
> without loading the whole file into R.

I would think you could use readLines to read in a batch of lines, run
(g)regexpr, and keep track of matches and position.

Create a connection to the file using file() first, and then subsequent
calls to readLines will start where you left off.

But you will need to adjust the position indices returned by gregexpr by how
far into the file you are.  Seems very doable.

+ seth

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide!
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list