[R] R tools for large files
maj at stats.waikato.ac.nz
Tue Aug 26 10:45:28 CEST 2003
I don't know much about the concept of "connection" but I had supposed
it to at least include the concept of "file" and perhaps also "input
device" and "output device'. I guess the important point that you are
making is that it is sequential in the sense that you describe. I
suppose at the time that I wrote my emails I didn't *know* that this was
the case but rather assumed that this must be so, since it would be
tedious in the extreme to have to work with the access functions if they
kept going back to the beginning of the connection.
It may help to explain the application. The large files that I am
working with are themselves statistical summaries of internet traffic
flows (you will appreciate why they can be almost arbitrarily large!) I
am interested in clustering these flows into different classes of
traffic. I am using a model-based approach, so that the end-point will
be statistical models for each cluster. Once these have been estimated
they may be used in the classification of future traffic [including a
residual class of traffic that does not fit any cluster well].
Based on experience with my clustering software (Multimix) I believe
that it should work well on data sets of, say, 3000 observations. I plan
to select a small number of random subsets of this size. The replication
of these subsets should help me with model selection questions (How many
Clusters? How complex should each cluster model be?)
Tom Mulholland makes a good point when he notes that many R users (and
other users) have very little control over their computing environment
owing to somewhat arbitrary IT management decisions. For this reason it
will be advantageous to have several solutions to large file problems.
I'm pleased that you think that efficient R functions for manipulating
numbered lines from files may be written. I'm going to have a go at it
just as soon as I finish a big item of paperwork!
BTW, I will be out of town and with much reduced email access over the
next week or so, so if I don't reply to the list or individuals this
should not be put down to laziness or rudeness!
PS Give my regards to Chris Hennig.
Martin Maechler wrote:
> Hi Murray,
> from reading your summarizing reply, I wonder if you missed the
> most important point about "connection"s connection := generalization of file):
> Once you open() one, you can read it **sequentially**, e.g., in
> bunches of a "few" lines i.e., you don't re-start from the
> beginning each time.
> I think this will allow to devise a pretty efficient R function
> for reading (and returning as a vector of strings) line numbers
> (n1, n2,..., nm).
> Did you know this? If not, maybe you forward this answer (and
> your reaction to it) to R-help as well.
> Martin Maechler <maechler at stat.math.ethz.ch> http://stat.ethz.ch/~maechler/
> Seminar fuer Statistik, ETH-Zentrum LEO C16 Leonhardstr. 27
> ETH (Federal Inst. Technology) 8092 Zurich SWITZERLAND
> phone: x-41-1-632-3408 fax: ...-1228 <><
Dr Murray Jorgensen http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz Fax 7 838 4155
Phone +64 7 838 4773 wk +64 7 849 6486 home Mobile 021 1395 862
More information about the R-help