[Rd] Importing csv files

Frank E Harrell Jr f.harrell at vanderbilt.edu
Thu Dec 23 17:43:07 CET 2004


Prof Brian Ripley wrote:
> I think we need to know what you mean by `large' and why read.table is 
> not fast enough (and hence if some of the planned improvements might be 
> all that is needed).

I was referring to the e-mail exchanges on r-help about read.table a few 
weeks ago, then there was a new discussion the other day concerning RAM 
usage and read.table not knowing the number of rows up front.  I believe 
that the posters provided some timings and examples.

> 
> Could you make some examples available for profiling?
> 
> It seems to me that there are some delicate licensing issues in 
> distributing a product that writes .rda format except under GPL. See, 
> for example, the GPL FAQ.

My understanding is that David is not distributing dataload any more, 
though I would not like to discourage commercial vendors (such as 
providers of Stat/Transfer and DBMSCOPY) from providing .rda output as 
an option.  I assume that new code written under GPL would not be a 
problem.  -Frank

> 
> On Thu, 23 Dec 2004, Frank E Harrell Jr wrote:
> 
>> There is a recurring need for importing large csv files quickly.  
>> David Baird's dataload is a standalone program that will directly 
>> create .rda files from .csv (it also handles many other conversions).  
>> Unfortunately dataload is no longer publicly available because of some 
>> kind of relationship with Stat/Transfer.  The idea is a good one, 
>> though.  I wonder if anyone would volunteer to replicate the csv->rda 
>> standalone functionality or to provide some Perl or Python tools for 
>> making creation of .rda files somewhat easy outside of R.
>>
>> As an aside, I routinely see 30-fold reductions in file sizes for .rda 
>> files (made with save(..., compress=TRUE)) compared with the size of 
>> SAS binary datasets.  And load( ) times are fast.
>>
>> It's been a great year for R.  Let me take this opportunity to thank 
>> the R leaders for a fantastic job that gives immeasurable benefits to 
>> the community.
> 
> 


-- 
Frank E Harrell Jr   Professor and Chair           School of Medicine
                      Department of Biostatistics   Vanderbilt University



More information about the R-devel mailing list