[R] R tools for large files

Wed Aug 27 13:21:14 CEST 2003

On Wed, 27 Aug 2003 13:03:39 +1200 (NZST), you wrote:

>For real efficiency here, what's wanted is a variant of readLines
>where n is an index vector (a vector of non-negative integers,
>a vector of non-positive integers, or a vector of logicals) saying
>which lines should be kept.

I think that's too esoteric to be worth doing.  Most often in cases
where you aren't reading every line, you don't know which lines to
read until you've read earlier ones.

>There are two fairly clear sources of overhead in the R code:
>(1) the overhead of reading characters one at a time through Rconn_fgetc()
>    instead of a block or line at a time.  mawk doesn't use fgets() for
>    reading, and _does_ have the overhead of repeatedly checking a
>    regular expression to determine where the end of the line is,
>    which it is sensible enough to fast-path.

One complication with reading a block at a time is what to do when you
read too far.  Not all connections can use seek() to reposition to the
beginning, so you'd need to read them one character at a time, (or
attach a buffer somehow, but then what about rw connections?)

>The simplest thing that could possibly work would be to add a function
>skipLines(con, n) which simply read and discarded n lines.
>
>	 result <- scan(textConnection(lines), list( .... ))

That's probably worth doing.

Duncan Murdoch