[Rd] Importing csv files

Thu Dec 23 18:31:37 CET 2004

On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:

> Prof Brian Ripley wrote:
>> I think we need to know what you mean by `large' and why read.table is not 
>> fast enough (and hence if some of the planned improvements might be all 
>> that is needed).
>
> I was referring to the e-mail exchanges on r-help about read.table a few 
> weeks ago, then there was a new discussion the other day concerning RAM usage 
> and read.table not knowing the number of rows up front.  I believe that the 
> posters provided some timings and examples.

I have yet to see any which used read.table competently which were slow 
(although the RAM usage could be higher than some people expected). 
Unless people have followed _all_ the hints in the Data manual, I don't 
think there is anything to discuss.

There is an issue with reading factors with just a few unique values, but 
that is one of the things being worked on.

>> Could you make some examples available for profiling?

Anyone who actually has a problem, then?

>> It seems to me that there are some delicate licensing issues in 
>> distributing a product that writes .rda format except under GPL. See, for 
>> example, the GPL FAQ.
>
> My understanding is that David is not distributing dataload any more, though 
> I would not like to discourage commercial vendors (such as providers of 
> Stat/Transfer and DBMSCOPY) from providing .rda output as an option.  I 
> assume that new code written under GPL would not be a problem.  -Frank

I said `except under GPL'.  I am not trying to discourage anyone, just 
pointing out that GPL has far-ranging implications that are often 
over-looked.

>> On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
>> 
>>> There is a recurring need for importing large csv files quickly.  David 
>>> Baird's dataload is a standalone program that will directly create .rda 
>>> files from .csv (it also handles many other conversions).  Unfortunately 
>>> dataload is no longer publicly available because of some kind of 
>>> relationship with Stat/Transfer.  The idea is a good one, though.  I 
>>> wonder if anyone would volunteer to replicate the csv->rda standalone 
>>> functionality or to provide some Perl or Python tools for making creation 
>>> of .rda files somewhat easy outside of R.
>>> 
>>> As an aside, I routinely see 30-fold reductions in file sizes for .rda 
>>> files (made with save(..., compress=TRUE)) compared with the size of SAS 
>>> binary datasets.  And load( ) times are fast.
>>> 
>>> It's been a great year for R.  Let me take this opportunity to thank the R 
>>> leaders for a fantastic job that gives immeasurable benefits to the 
>>> community.

It's certainly been a great year for people to complain about R, R-help 
....  We say

 	R is a collaborative project with many contributors.

but it seems to me much less than it used to be.

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595