[R] Reading large files

Sat Feb 6 23:18:51 CET 2010

Gabor:
Here is the update. As you can see, I got the same error as below in 1.

1. Error
 test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n")
Error in readRegistry(key, maxdepth = 3) : 
  Registry key 'SOFTWARE\R-core' not found 

2. But the loading of the bigger file was successful as you can see below. 857 MB, 333,250 rows, 227 columns. This is good.

I will have to just do an inline edit in Perl and change the file to csv from within R and then call the read.csv.sql. 

If you have any suggestions to fix 1, I would like to try them.

 system.time(test_df <- read.csv.sql(file="out.txt"))
   user  system elapsed 
 192.53   15.50  213.68 
Warning message:
closing unused connection 3 (out.txt) 

Thanks again.

Satish

-----Original Message-----
From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com] 
Sent: Saturday, February 06, 2010 3:02 PM
To: Vadlamani, Satish {FLNA}
Cc: r-help at r-project.org
Subject: Re: [R] Reading large files

Note that you can shorten #1 to read.csv.sql("out.txt") since your
other arguments are the default values.

For the second one, use read.csv.sql, eliminate the arguments that are
defaults anyways (should not cause a problem but its error prone) and
add an explicit eol= argument since SQLite can have problems with end
of line in some cases.  Also test out your perl script separately from
R first to ensure that it works:

test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
parse_3wkout.pl", eol = "\n")

SQLite has some known problems with end of line so try it with and
without the eol= argument just in case.  When I just made up the
following gawk example I noticed that I did need to specify the eol=
argument.

Also I have added a complete example using gawk as Example 13c on the
home page just now:
http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql

On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:
> Gabor:
>
> I had success with the following.
> 1. I created a csv file with a perl script called "out.txt". Then ran the following successfully
> library("sqldf")
> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", header = TRUE, sep = ",", dbname = tempfile())
>
> 2. I did not have success with the following. Could you tell me what I may be doing wrong? I could paste the perl script if necessary. From the perl script, I am reading the file, creating the csv record and printing each record one by one and then exiting.
>
> Thanks.
>
> Not had success with below..
> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
> test_df
>
> Error message below:
> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
> Error in readRegistry(key, maxdepth = 3) :
>  Registry key 'SOFTWARE\R-core' not found
> In addition: Warning messages:
> 1: closing unused connection 14 (3wkoutstatfcst_small.dat)
> 2: closing unused connection 13 (3wkoutstatfcst_small.dat)
> 3: closing unused connection 11 (3wkoutstatfcst_small.dat)
> 4: closing unused connection 9 (3wkoutstatfcst_small.dat)
> 5: closing unused connection 3 (3wkoutstatfcst_small.dat)
>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
> Error in readRegistry(key, maxdepth = 3) :
>  Registry key 'SOFTWARE\R-core' not found
>
> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Sent: Saturday, February 06, 2010 12:14 PM
> To: Vadlamani, Satish {FLNA}
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading large files
>
> No.
>
> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA}
> <SATISH.VADLAMANI at fritolay.com> wrote:
>> Gabor:
>> Can I pass colClasses as a vector to read.csv.sql? Thanks.
>> Satish
>>
>>
>> -----Original Message-----
>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>> Sent: Saturday, February 06, 2010 9:41 AM
>> To: Vadlamani, Satish {FLNA}
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Reading large files
>>
>> Its just any Windows batch command string that filters stdin to
>> stdout.  What the command consists of should not be important.   An
>> invocation of perl that runs a perl script that filters stdin to
>> stdout might look like this:
>>  read.csv.sql("myfile.dat", filter = "perl myprog.pl")
>>
>> For an actual example see the source of read.csv2.sql which defaults
>> to using a Windows vbscript program as a filter.
>>
>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA}
>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>> Jim, Gabor:
>>> Thanks so much for the suggestions where I can use read.csv.sql and embed Perl (or gawk). I just want to mention that I am running on Windows. I am going to read the documentation the filter argument and see if it can take a decent sized Perl script and then use its output as input.
>>>
>>> Suppose that I write a Perl script that parses this fwf file and creates a CSV file. Can I embed this within the read.csv.sql call? Or, can it only be a statement or something? If you know the answer, please let me know. Otherwise, I will try a few things and report back the results.
>>>
>>> Thanks again.
>>> Saitsh
>>>
>>>
>>> -----Original Message-----
>>> From: jim holtman [mailto:jholtman at gmail.com]
>>> Sent: Saturday, February 06, 2010 6:16 AM
>>> To: Gabor Grothendieck
>>> Cc: Vadlamani, Satish {FLNA}; r-help at r-project.org
>>> Subject: Re: [R] Reading large files
>>>
>>> In perl the 'unpack' command makes it very easy to parse fixed fielded data.
>>>
>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck
>>> <ggrothendieck at gmail.com> wrote:
>>>> Note that the filter= argument on read.csv.sql can be used to pass the
>>>> input through a filter written in perl, [g]awk or other language.
>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk")
>>>>
>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed
>>>> width fields, e.g.
>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html
>>>> making this very easy but perl or whatever you are most used to would
>>>> be fine too.
>>>>
>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA}
>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>> Hi Gabor:
>>>>> Thanks. My files are all in fixed width format. They are a lot of them. It would take me some effort to convert them to CSV. I guess this cannot be avoided? I can write some Perl scripts to convert fixed width format to CSV format and then start with your suggestion. Could you let me know your thoughts on the approach?
>>>>> Satish
>>>>>
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>> Sent: Friday, February 05, 2010 5:16 PM
>>>>> To: Vadlamani, Satish {FLNA}
>>>>> Cc: r-help at r-project.org
>>>>> Subject: Re: [R] Reading large files
>>>>>
>>>>> If your problem is just how long it takes to load the file into R try
>>>>> read.csv.sql in the sqldf package.  A single read.csv.sql call can
>>>>> create an SQLite database and table layout for you, read the file into
>>>>> the database (without going through R so R can't slow this down),
>>>>> extract all or a portion into R based on the sql argument you give it
>>>>> and then remove the database.  See the examples on the home page:
>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>>>>
>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani
>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>
>>>>>> Matthew:
>>>>>> If it is going to help, here is the explanation. I have an end state in
>>>>>> mind. It is given below under "End State" header. In order to get there, I
>>>>>> need to start somewhere right? I started with a 850 MB file and could not
>>>>>> load in what I think is reasonable time (I waited for an hour).
>>>>>>
>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM machine
>>>>>> and there is no paging activity when loading the 850 MB file.
>>>>>>
>>>>>> I have seen other threads on the same types of questions. I did not see any
>>>>>> clear cut answers or errors that I could have been making in the process. If
>>>>>> I am missing something, please let me know. Thanks.
>>>>>> Satish
>>>>>>
>>>>>>
>>>>>> End State
>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R"
>>>>>>
>>>>>>
>>>>>> -----
>>>>>> Satish Vadlamani
>>>>>> --
>>>>>> View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html
>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>
>>>>>> ______________________________________________
>>>>>> R-help at r-project.org mailing list
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>
>>>>>
>>>>
>>>> ______________________________________________
>>>> R-help at r-project.org mailing list
>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>
>>>
>>>
>>>
>>> --
>>> Jim Holtman
>>> Cincinnati, OH
>>> +1 513 646 9390
>>>
>>> What is the problem that you are trying to solve?
>>>
>>
>