[R] R tools for large files

Tue Aug 26 05:55:48 CEST 2003

As some of the conversation has noted the 30 second mark as an arbitrary
benchmark I would also chime in that there is also an assumption that
any non-R related issues that impact upon being able to usefully use R
should be ignored. In the real world we can't always control everything
about our environment. So if there are improvements that can be made
that help mitigate the reality of the world, I would welcome them.

As a little test I broke the rules of my organisation and actually put a
dataset on my C: drive. Not unexpectedly, the  performance vastly
improved. What would in the normal (at home) be a 10 second load becomes
a 40 second load in a corporate environment. I have found the
conversation helpful and it would appear that there are opportunities
for improvement that I would find helpful in my production environment.
The other aside is that I have no UNIX like tools, not because they
don't exist, but because the environment I work in does not allow me to
use them. This is not sufficient reason for me to bleat about it. It
just is. By and large, I just get on with it. My point is that while I
accept that these issues are peripheral to R, they do impact upon the
useability of R.

I'm sure that there are people working with large databases in R (The
SPSS datasets that I regularly interact with vary between 97MB and
200MB) It could be finger trouble on my part, but I find I have to
subset them before I can read them into R. If I thought I could usefully
convert these datasets into something that R could pick and choose from
without reaching the out of memory problem, I would be very happy. In
the meantime my lack of expertise has left me with a workable albeit
clumsy process.

I will continue to champion R in my organisation, but the present score
is SPSS-50, SAS-149, R-1. But all the really creative charts only come
from one engine in this place.

> system.time(load("P:/.../0203Mapdata.rdata"))
[1]  9.79  0.97 37.45    NA    NA
> system.time(load("C:/TEMP/0203Mapdata.rdata"))
[1] 10.07  0.18 10.49    NA    NA
> version
         _              
platform i386-pc-mingw32
arch     i386           
os       mingw32        
system   i386, mingw32  
status                  
major    1              
minor    7.1            
year     2003           
month    06             
day      16             
language R     

_________________________________________________

Tom Mulholland
Senior Policy Officer
WA Country Health Service
Tel: (08) 9222 4062

The contents of this e-mail transmission are confidential and may be
protected by professional privilege. The contents are intended only for
the named recipients of this e-mail. If you are not the intended
recipient, you are hereby notified that any use, reproduction,
disclosure or distribution of the information contained in this e-mail
is prohibited. Please notify the sender immediately.

-----Original Message-----
From: Murray Jorgensen [mailto:maj at stats.waikato.ac.nz] 
Sent: Monday, 25 August 2003 5:16 PM
To: Prof Brian Ripley
Cc: R-help
Subject: Re: [R] R tools for large files

At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:
>I think that is only a medium-sized file.

"Large" for my purposes means "more than I really want to read into
memory" which in turn means "takes more than 30s". I'm at home now and
the file isn't so I'm not sure if the file is large or not.

More responses interspesed below. BTW, I forgot to mention that I'm
using Windows and so do not have nice unix tools readily available.

>On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>
>> I'm wondering if anyone has written some functions or code for 
>> handling
>> very large files in R. I am working with a data file that is 41 
>> variables times who knows how many observations making up 27MB
altogether.
>> 
>> The sort of thing that I am thinking of having R do is
>> 
>> - count the number of lines in a file
>
>You can do that without reading the file into memory: use 
>system(paste("wc -l", filename))

Don't think that I can do that in Windows XL.

or read in blocks of lines via a 
>connection

But that does sound promising!

>
>> - form a data frame by selecting all cases whose line numbers are in 
>> a
>> supplied vector (which could be used to extract random subfiles of 
>> particular sizes)
>
>R should handle that easily in today's memory sizes.  Buy some more RAM

>if
>you don't already have 1/2Gb.  As others have said, for a real large
file,
>use a RDBMS to do the selection for you.

It's just that R is so good in reading in initial segments of a file
that I can't believe that it can't be effective in reading more general
(pre-specified) subsets.

Murray

>
>-- 
>Brian D. Ripley,                  ripley at stats.ox.ac.uk
>Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
>University of Oxford,             Tel:  +44 1865 272861 (self)
>1 South Parks Road,                     +44 1865 272866 (PA)
>Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862

______________________________________________
R-help at stat.math.ethz.ch mailing list
https://www.stat.math.ethz.ch/mailman/listinfo/r-help