[R] Reading large files

Gabor Grothendieck ggrothendieck at gmail.com
Sat Feb 6 23:28:04 CET 2010


The software attempts to read the registry and temporarily augment the
path in case you have Rtools installed so that the filter can access
all the tools that Rtools provides.  I am not sure why its failing on
your system but there is evidently some differences between systems
here and I have added some code to trap and bypass that portion in
case it fails.  I have added the new version to the svn repository so
try this:

library(sqldf)
# overwrite with development version
source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
# your code to call read.csv.sql


On Sat, Feb 6, 2010 at 5:18 PM, Vadlamani, Satish {FLNA}
<SATISH.VADLAMANI at fritolay.com> wrote:
>
> Gabor:
> Here is the update. As you can see, I got the same error as below in 1.
>
> 1. Error
>  test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n")
> Error in readRegistry(key, maxdepth = 3) :
>  Registry key 'SOFTWARE\R-core' not found
>
> 2. But the loading of the bigger file was successful as you can see below. 857 MB, 333,250 rows, 227 columns. This is good.
>
> I will have to just do an inline edit in Perl and change the file to csv from within R and then call the read.csv.sql.
>
> If you have any suggestions to fix 1, I would like to try them.
>
>  system.time(test_df <- read.csv.sql(file="out.txt"))
>   user  system elapsed
>  192.53   15.50  213.68
> Warning message:
> closing unused connection 3 (out.txt)
>
> Thanks again.
>
> Satish
>
> -----Original Message-----
> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
> Sent: Saturday, February 06, 2010 3:02 PM
> To: Vadlamani, Satish {FLNA}
> Cc: r-help at r-project.org
> Subject: Re: [R] Reading large files
>
> Note that you can shorten #1 to read.csv.sql("out.txt") since your
> other arguments are the default values.
>
> For the second one, use read.csv.sql, eliminate the arguments that are
> defaults anyways (should not cause a problem but its error prone) and
> add an explicit eol= argument since SQLite can have problems with end
> of line in some cases.  Also test out your perl script separately from
> R first to ensure that it works:
>
> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
> parse_3wkout.pl", eol = "\n")
>
> SQLite has some known problems with end of line so try it with and
> without the eol= argument just in case.  When I just made up the
> following gawk example I noticed that I did need to specify the eol=
> argument.
>
> Also I have added a complete example using gawk as Example 13c on the
> home page just now:
> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>
>
> On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA}
> <SATISH.VADLAMANI at fritolay.com> wrote:
>> Gabor:
>>
>> I had success with the following.
>> 1. I created a csv file with a perl script called "out.txt". Then ran the following successfully
>> library("sqldf")
>> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", header = TRUE, sep = ",", dbname = tempfile())
>>
>> 2. I did not have success with the following. Could you tell me what I may be doing wrong? I could paste the perl script if necessary. From the perl script, I am reading the file, creating the csv record and printing each record one by one and then exiting.
>>
>> Thanks.
>>
>> Not had success with below..
>> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>> test_df
>>
>> Error message below:
>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>> Error in readRegistry(key, maxdepth = 3) :
>>  Registry key 'SOFTWARE\R-core' not found
>> In addition: Warning messages:
>> 1: closing unused connection 14 (3wkoutstatfcst_small.dat)
>> 2: closing unused connection 13 (3wkoutstatfcst_small.dat)
>> 3: closing unused connection 11 (3wkoutstatfcst_small.dat)
>> 4: closing unused connection 9 (3wkoutstatfcst_small.dat)
>> 5: closing unused connection 3 (3wkoutstatfcst_small.dat)
>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>> Error in readRegistry(key, maxdepth = 3) :
>>  Registry key 'SOFTWARE\R-core' not found
>>
>> -----Original Message-----
>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>> Sent: Saturday, February 06, 2010 12:14 PM
>> To: Vadlamani, Satish {FLNA}
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Reading large files
>>
>> No.
>>
>> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA}
>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>> Gabor:
>>> Can I pass colClasses as a vector to read.csv.sql? Thanks.
>>> Satish
>>>
>>>
>>> -----Original Message-----
>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>> Sent: Saturday, February 06, 2010 9:41 AM
>>> To: Vadlamani, Satish {FLNA}
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Reading large files
>>>
>>> Its just any Windows batch command string that filters stdin to
>>> stdout.  What the command consists of should not be important.   An
>>> invocation of perl that runs a perl script that filters stdin to
>>> stdout might look like this:
>>>  read.csv.sql("myfile.dat", filter = "perl myprog.pl")
>>>
>>> For an actual example see the source of read.csv2.sql which defaults
>>> to using a Windows vbscript program as a filter.
>>>
>>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA}
>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>> Jim, Gabor:
>>>> Thanks so much for the suggestions where I can use read.csv.sql and embed Perl (or gawk). I just want to mention that I am running on Windows. I am going to read the documentation the filter argument and see if it can take a decent sized Perl script and then use its output as input.
>>>>
>>>> Suppose that I write a Perl script that parses this fwf file and creates a CSV file. Can I embed this within the read.csv.sql call? Or, can it only be a statement or something? If you know the answer, please let me know. Otherwise, I will try a few things and report back the results.
>>>>
>>>> Thanks again.
>>>> Saitsh
>>>>
>>>>
>>>> -----Original Message-----
>>>> From: jim holtman [mailto:jholtman at gmail.com]
>>>> Sent: Saturday, February 06, 2010 6:16 AM
>>>> To: Gabor Grothendieck
>>>> Cc: Vadlamani, Satish {FLNA}; r-help at r-project.org
>>>> Subject: Re: [R] Reading large files
>>>>
>>>> In perl the 'unpack' command makes it very easy to parse fixed fielded data.
>>>>
>>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck
>>>> <ggrothendieck at gmail.com> wrote:
>>>>> Note that the filter= argument on read.csv.sql can be used to pass the
>>>>> input through a filter written in perl, [g]awk or other language.
>>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk")
>>>>>
>>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed
>>>>> width fields, e.g.
>>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html
>>>>> making this very easy but perl or whatever you are most used to would
>>>>> be fine too.
>>>>>
>>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA}
>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>> Hi Gabor:
>>>>>> Thanks. My files are all in fixed width format. They are a lot of them. It would take me some effort to convert them to CSV. I guess this cannot be avoided? I can write some Perl scripts to convert fixed width format to CSV format and then start with your suggestion. Could you let me know your thoughts on the approach?
>>>>>> Satish
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>>> Sent: Friday, February 05, 2010 5:16 PM
>>>>>> To: Vadlamani, Satish {FLNA}
>>>>>> Cc: r-help at r-project.org
>>>>>> Subject: Re: [R] Reading large files
>>>>>>
>>>>>> If your problem is just how long it takes to load the file into R try
>>>>>> read.csv.sql in the sqldf package.  A single read.csv.sql call can
>>>>>> create an SQLite database and table layout for you, read the file into
>>>>>> the database (without going through R so R can't slow this down),
>>>>>> extract all or a portion into R based on the sql argument you give it
>>>>>> and then remove the database.  See the examples on the home page:
>>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>>>>>
>>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani
>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>>
>>>>>>> Matthew:
>>>>>>> If it is going to help, here is the explanation. I have an end state in
>>>>>>> mind. It is given below under "End State" header. In order to get there, I
>>>>>>> need to start somewhere right? I started with a 850 MB file and could not
>>>>>>> load in what I think is reasonable time (I waited for an hour).
>>>>>>>
>>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM machine
>>>>>>> and there is no paging activity when loading the 850 MB file.
>>>>>>>
>>>>>>> I have seen other threads on the same types of questions. I did not see any
>>>>>>> clear cut answers or errors that I could have been making in the process. If
>>>>>>> I am missing something, please let me know. Thanks.
>>>>>>> Satish
>>>>>>>
>>>>>>>
>>>>>>> End State
>>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R"
>>>>>>>
>>>>>>>
>>>>>>> -----
>>>>>>> Satish Vadlamani
>>>>>>> --
>>>>>>> View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html
>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>
>>>>>>> ______________________________________________
>>>>>>> R-help at r-project.org mailing list
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>
>>>>>>
>>>>>
>>>>> ______________________________________________
>>>>> R-help at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Jim Holtman
>>>> Cincinnati, OH
>>>> +1 513 646 9390
>>>>
>>>> What is the problem that you are trying to solve?
>>>>
>>>
>>
>



More information about the R-help mailing list