[R] R tools for large files

Murray Jorgensen maj at stats.waikato.ac.nz
Tue Aug 26 01:45:28 CEST 2003


I would like to thank those who have responded and especially Brian 
Ripley for making his unix tools for Windows available. A colleague has 
also mentioned to me the set of unix tools called Cygwin.

Two things that can be done with R alone are to read the first n lines 
of a file into n strings with readLines() and to scan in a block of the 
file after skipping a number of lines.

I will probably use Fortran to extract subsets of the file as I need to 
use it for other things that I am planning to do with the file.

I'll maybe also play a bit with readLines() and writeLines() inside 
loops to see if I can build up my random subsets of files this way.

BTW, I now estimate the file at about 100,000 lines so indeed, it is not 
all that large!

Murray Jorgensen

Prof Brian Ripley wrote:

> On Mon, 25 Aug 2003, Murray Jorgensen wrote:
> 
> 
>>At 08:12 25/08/2003 +0100, Prof Brian Ripley wrote:
>>
>>>I think that is only a medium-sized file.
>>
>>"Large" for my purposes means "more than I really want to read into memory"
>>which in turn means "takes more than 30s". I'm at home now and the file
>>isn't so I'm not sure if the file is large or not.
>>
>>More responses interspesed below. BTW, I forgot to mention that I'm using
>>Windows and so do not have nice unix tools readily available.
> 
> 
> But you do, thanks to me, as you need them to installed R packages.
> 
> 
>>>On Mon, 25 Aug 2003, Murray Jorgensen wrote:
>>>
>>>
>>>>I'm wondering if anyone has written some functions or code for handling 
>>>>very large files in R. I am working with a data file that is 41 
>>>>variables times who knows how many observations making up 27MB altogether.
>>>>
>>>>The sort of thing that I am thinking of having R do is
>>>>
>>>>- count the number of lines in a file
>>>
>>>You can do that without reading the file into memory: use
>>>system(paste("wc -l", filename)) 
>>
>>Don't think that I can do that in Windows XL.
> 
> 
> I presume you mean Windows XP?  Of course you can, and wc.exe is in 
> Rtools.zip!
> 
> 
>>or read in blocks of lines via a 
>>
>>>connection
>>
>>But that does sound promising!
>>
>>
>>>>- form a data frame by selecting all cases whose line numbers are in a 
>>>>supplied vector (which could be used to extract random subfiles of 
>>>>particular sizes)
>>>
>>>R should handle that easily in today's memory sizes.  Buy some more RAM if 
>>>you don't already have 1/2Gb.  As others have said, for a real large file,
>>>use a RDBMS to do the selection for you.
>>
>>It's just that R is so good in reading in initial segments of a file that I
>>can't believe that it can't be effective in reading more general
>>(pre-specified) subsets.
> 
> 

-- 
Dr Murray Jorgensen      http://www.stats.waikato.ac.nz/Staff/maj.html
Department of Statistics, University of Waikato, Hamilton, New Zealand
Email: maj at waikato.ac.nz                                Fax 7 838 4155
Phone  +64 7 838 4773 wk    +64 7 849 6486 home    Mobile 021 1395 862




More information about the R-help mailing list