[R] R tools for large files

Thu Aug 28 02:35:33 CEST 2003

Duncan Murdoch <dmurdoch at pair.com> wrote:
	One complication with reading a block at a time is what to do when you
	read too far.

It's called "buffering".

	Not all connections can use seek() to reposition to the
	beginning, so you'd need to read them one character at a time, (or
	attach a buffer somehow, but then what about rw connections?)

You don't need seek() to do buffered block-at-a-time reading.
For example, you can't lseek() on a UNIX terminal, but UNIX C stdio
*does* read a block at a time from a terminal.

I don't see what the problem with read-write connections is supposed
to be.  When you want to read from such a connection, you first force
out any buffered output, and then you read a buffer's worth (if
available) of input.  Of course the read buffer and the write buffer
are separate (C stdio has traditionally got this wrong, with the perverse
consequence that you have to fseek() when switching from reading to writing
or vice versa, but that doesn't mean it can't be got right).

To put all this in context though, remember that S was designed in a UNIX
environment to work in a UNIX environment and it was always intended to
exploit UNIX tools.  Even on a Windows box, if you get R, you get a
bunch of the usual UNIX tools with it.  Amongst other things, Perl is
freely available for Windows, a Perl program to read a couple of
hundred thousand records and spit them out in platform binary would
only be a few lines long, and R _is_ pretty good at reading binary data.
It really is important that R users should be allowed to use it the way
that the language was designed to be used.