[R] naive question

Wed Jun 30 04:46:11 CEST 2004

> Igor Rivin wrote:
>
>> I was not particularly annoyed, just disappointed, since R seems like
>> a much better thing than SAS in general, and doing everything with a
>> combination of hand-rolled tools is too much work. However, I do need
>> to work with very large data sets, and if it takes 20 minutes to read
>> them in, I have to explore other options (one of which might be
>> S-PLUS, which claims scalability as a major  , er, PLUS over R).
>
>
> If you are routinely working with very large data sets it would be
> worthwhile learning to use a relational database (PostgreSQL, MySQL,
> even Access) to store the data and then access it from R with RODBC or
> one of the specialized database packages.
>
I was thinking about that, but I had thought that this would help for
reading small pieces of the data (since subsetting would happen on the db
side), but not so much for reading big chunks. But it's certainly worth a
try....

> R is slow reading ASCII files because it is assembling the meta-data on
> the fly and it is continually checking the types of the variables being
> read.  If you know all this information and build it into your table
> definitions, reading the data will be much faster.

What do you mean by meta-data? Anyway, I agree that this would slow it
down, but I would suspect that even so there is a bit of room for
improvement, since five minutes for 12 million tokens comes out to
40000/second, which is really pretty bad on a 2-3 Ghz machine...
>
> A disadvantage of this approach is the need to learn yet another
> language and system.  I was going to do an example but found I could not
>  because I left all my SQL books at home (I'm travelling at the moment)
> and I couldn't remember the particular commands for loading a table from
>  an ASCII file.

Well, I will look into it (among other possibilities).