[R] the large dataset problem

(Ted Harding) ted.harding at nessie.mcc.ac.uk
Mon Jul 30 19:24:45 CEST 2007


On 30-Jul-07 11:40:47, Eric Doviak wrote:
> [...]

Sympathies for the constraints you are operating in!

> The "Introduction to R" manual suggests modifying input files with
> Perl. Any tips on how to get started? Would Perl Data Language (PDL) be
> a good choice?  http://pdl.perl.org/index_en.html

I've not used SIPP files, but itseems that they are available in
"delimited" format, including CSV.

For extracting a subset of fields (especially when large datasets may
stretch RAM resources) I would use awk rather than perl, since it
is a much lighter program, transparent to code for, efficient, and
it will do that job.

On a Linux/Unix system (see below), say I wanted to extract fields
1, 1000, 1275, .... , 5678 from a CSV file. Then the 'awk' line
that would do it would look like

awk '
 BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678)
' < sippfile.csv > newdata.csv

Awk reads one line at a tine, and does with it what you tell it to do.
It will not be overcome by a file with an enormous number of lines.
Perl would be similar. So long as one line fits comfortably into RAM,
you would not be limited by file size (unless you're running out
of disk space), and operation will be quick, even for very long
lines (as an experiment, I just set up a file with 10,000 fields
and 35 lines; awk output 6 selected fields from all 35 lines in
about 1 second, on the 366MHz 128MB RAM machine I'm on at the
moment. After transferring it to a 733MHz 512MB RAM machine, it was
too quick to estimate; so I duplicated the lines to get a 363-line
file, and now got those same fields out in a bit less than 1 second.
So that's over 300 lines/second, 200,000 lines a minute, a million
lines in 5 minutes; and all on rather puny hardware.).

In practice, you might want to write a separate script which woould
automatically create the necessary awk script (say if you supply
the filed names, haing already coded the filed positions corresponding
to filed names). You could exploit R's system() command to run the
scripts from within R, and then load in the filtered data.

> I wrote a script which loads large datasets a few lines at a time,
> writes the dozen or so variables of interest to a CSV file, removes
> the loaded data and then (via a "for" loop) loads the next few lines
> .... I managed to get it to work with one of the SIPP core files,
> but it's SLOOOOW.

See above ...

> Worse, if I discover later that I omitted a relevant variable,
> then I'll have to run the whole script all over again.

If the script worked quickly (as with awk), presumably you
wouldn't mind so much?

Regarding Linux/Unix versus Windows. It is general experience
that Linux/Unix works faster, more cleanly and efficiently, and
often more reliably, for similar tasks; and cam do so on low grade
hardware. Also, these systems come with dozens of file-processing
utilities (including perl and awk; also many others), each of which
has been written to be efficient at precisely the repertoire of
tasks it was designed for. A lot of Windows sotware carries a huge
overhead of either cosmetic dross, or a pantechnicon of functionality
of which you are only going to need 0.01% at any one time.

The Unix utilities have been ported to Windows, long since, but
I have no experience of using them in that environment. Others,
who have, can advise! But I'd seriously suggest getting hold of them.

Hoping this helps,
Ted.

--------------------------------------------------------------------
E-Mail: (Ted Harding) <ted.harding at nessie.mcc.ac.uk>
Fax-to-email: +44 (0)870 094 0861
Date: 30-Jul-07                                       Time: 18:24:41
------------------------------ XFMail ------------------------------



More information about the R-help mailing list