[R] R tools for large files

Tue Aug 26 10:45:28 CEST 2003

Hi Martin,

I don't know much about the concept of "connection" but I had supposed 
it to at least include the concept of "file" and perhaps also "input 
device" and "output device'. I guess the important point that you are 
making is that it is sequential in the sense that you describe. I 
suppose at the time that I wrote my emails I didn't *know* that this was 
  the case but rather assumed that this must be so, since it would be 
tedious in the extreme to have to work with the access functions if they 
kept going back to the beginning of the connection.

It may help to explain the application. The large files that I am 
working with are themselves statistical summaries of internet traffic 
flows (you will appreciate why they can be almost arbitrarily large!) I 
am interested in clustering these flows into different classes of 
traffic. I am using a model-based approach, so that the end-point will 
be statistical models for each cluster. Once these have been estimated 
they may be used in the classification of future traffic [including a 
residual class of traffic that does not fit any cluster well].

Based on experience with my clustering software (Multimix) I believe 
that it should work well on data sets of, say, 3000 observations. I plan 
to select a small number of random subsets of this size. The replication 
of these subsets should help me with model selection questions (How many 
Clusters? How complex should each cluster model be?)

Tom Mulholland makes a good point when he notes that many R users (and 
other users) have very little control over their computing environment 
owing to somewhat arbitrary IT management decisions. For this reason it 
will be advantageous to have several solutions to large file problems.

I'm pleased that you think that efficient R functions for manipulating 
numbered lines from files may be written. I'm going to have a go at it 
just as soon as I finish a big item of paperwork!

BTW, I will be out of town and with much reduced email access over the 
next week or so, so if I don't reply to the list or individuals this 
should not be put down to laziness or rudeness!

Cheers,

Murray Jorgensen

PS  Give my regards to Chris Hennig.

Martin Maechler wrote:
> Hi Murray,
> 
> from reading your summarizing reply, I wonder if you missed the
> most important point about "connection"s  connection := generalization of file):
> 
> Once you open() one, you can read it **sequentially**, e.g., in
> bunches of a "few" lines  i.e., you don't re-start from the
> beginning each time.
> I think this will allow to devise a pretty efficient R function
> for reading (and returning as a vector of strings) line numbers
> (n1, n2,..., nm).   
> 
> Did you know this?  If not, maybe you forward this answer (and
> your reaction to it) to R-help as well.
> 
> Regards,
> Martin Maechler <maechler at stat.math.ethz.ch>	http://stat.ethz.ch/~maechler/
> Seminar fuer Statistik, ETH-Zentrum  LEO C16	Leonhardstr. 27
> ETH (Federal Inst. Technology)	8092 Zurich	SWITZERLAND
> phone: x-41-1-632-3408		fax: ...-1228			<><
> 
> 

-- 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862