[R] Reading large files

Gabor Grothendieck ggrothendieck at gmail.com
Sun Feb 7 02:58:04 CET 2010


By the way, if you use the H2 database with sqldf then there is a
second way to read files in using sqldf.

# 1. run your perl program outside of R to create myfile.csv, say.

# 2. install java from http://java.sun.com
# and then install the RH2 package from CRAN
install.packages("RH2")

# 3. load sqldf and RH2
# sqldf automatically uses H2 database if RH2 is loaded
library(RH2)
library(sqldf)

# 4. read file using sqldf making use of the CSVREAD function in H2
DF <- sqldf("select * from CSVREAD('myfile.csv')")


On Sat, Feb 6, 2010 at 8:37 PM, Gabor Grothendieck
<ggrothendieck at gmail.com> wrote:
> file= is the input data file. filter= is just a command string that
> specifies a program to run (not a data file).
>
> 1. If Filename.tmp is the name of a temporary file (that it creates)
> it runs a batch command similar to this:
>      paste("cmd /c", filter, "<", file, ">", Filename.tmp)
>
> 2. Then it reads Filename.tmp into the database (which it creates for
> you) and does this without involving R and
>
> 3. finally it reads the table in the database that was created into R,
> as an R dataframe, and destroys the database.
>
>
> On Sat, Feb 6, 2010 at 7:53 PM, Vadlamani, Satish {FLNA}
> <SATISH.VADLAMANI at fritolay.com> wrote:
>> Gabor:
>> It did suppress the message now and I was able to load the data. Question.
>>
>> 1. test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl parse_3wkout.pl")
>>
>> In the statement above, should the filename in file= and the file name that the perl script uses through the filter= command be the same? I would think not.  I would say that if filter= is passed to the statement, then the filename should be ignored. Is this how it works?
>>
>> Thanks.
>> Satish
>>
>>
>> -----Original Message-----
>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>> Sent: Saturday, February 06, 2010 4:58 PM
>> To: Vadlamani, Satish {FLNA}
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Reading large files
>>
>> I have uploaded another version which suppresses display of the error
>> message but otherwise works the same.  Omitting the redundant
>> arguments we have:
>>
>> ibrary(sqldf)
>> # next line is only needed once per session to read in devel version
>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
>>
>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
>> parse_3wkout.pl")
>>
>>
>> On Sat, Feb 6, 2010 at 5:48 PM, Vadlamani, Satish {FLNA}
>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>> Gabor:
>>> Please see the results below. Sourcing your new R script worked (although with the same error message). If I put eol="\n" option, it is adding a "\r" to the last column. I took out the eol option below. This is just some more feedback to you.
>>>
>>> I am thinking that I will just do an inline edit in Perl (that is create the csv file through Perl by overwriting the current file) and then use read.csv.sql without the filter= option. This seems to be more tried and tested. If you have any suggestions, please let me know. Thanks.
>>> Satish
>>>
>>>
>>> BEFORE SOURCING YOUR NEW R SCRIPT
>>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl")
>>> Error in readRegistry(key, maxdepth = 3) :
>>>  Registry key 'SOFTWARE\R-core' not found
>>>> test_df
>>> Error: object 'test_df' not found
>>>
>>> AFTER SOURCING YOUR NEW R SCRIPT
>>>> source("f:/dp_modeling_team/downloads/R/sqldf.R")
>>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl")
>>> Error in readRegistry(key, maxdepth = 3) :
>>>  Registry key 'SOFTWARE\R-core' not found
>>> In addition: Warning messages:
>>> 1: closing unused connection 5 (3wkoutstatfcst_small.dat)
>>> 2: closing unused connection 4 (3wkoutstatfcst_small.dat)
>>> 3: closing unused connection 3 (3wkoutstatfcst_small.dat)
>>>> test_df
>>>   allgeo area1 zone dist ccust1 whse bindc ccust2 account area2 ccust3
>>> 1       A     4    1   37     99 4925  4925     99      99     4     99
>>> 2       A     4    1   37     99 4925  4925     99      99     4     99
>>>
>>> -----Original Message-----
>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>> Sent: Saturday, February 06, 2010 4:28 PM
>>> To: Vadlamani, Satish {FLNA}
>>> Cc: r-help at r-project.org
>>> Subject: Re: [R] Reading large files
>>>
>>> The software attempts to read the registry and temporarily augment the
>>> path in case you have Rtools installed so that the filter can access
>>> all the tools that Rtools provides.  I am not sure why its failing on
>>> your system but there is evidently some differences between systems
>>> here and I have added some code to trap and bypass that portion in
>>> case it fails.  I have added the new version to the svn repository so
>>> try this:
>>>
>>> library(sqldf)
>>> # overwrite with development version
>>> source("http://sqldf.googlecode.com/svn/trunk/R/sqldf.R")
>>> # your code to call read.csv.sql
>>>
>>>
>>> On Sat, Feb 6, 2010 at 5:18 PM, Vadlamani, Satish {FLNA}
>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>
>>>> Gabor:
>>>> Here is the update. As you can see, I got the same error as below in 1.
>>>>
>>>> 1. Error
>>>>  test_df <- read.csv.sql(file="out_small.txt", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", eol="\n")
>>>> Error in readRegistry(key, maxdepth = 3) :
>>>>  Registry key 'SOFTWARE\R-core' not found
>>>>
>>>> 2. But the loading of the bigger file was successful as you can see below. 857 MB, 333,250 rows, 227 columns. This is good.
>>>>
>>>> I will have to just do an inline edit in Perl and change the file to csv from within R and then call the read.csv.sql.
>>>>
>>>> If you have any suggestions to fix 1, I would like to try them.
>>>>
>>>>  system.time(test_df <- read.csv.sql(file="out.txt"))
>>>>   user  system elapsed
>>>>  192.53   15.50  213.68
>>>> Warning message:
>>>> closing unused connection 3 (out.txt)
>>>>
>>>> Thanks again.
>>>>
>>>> Satish
>>>>
>>>> -----Original Message-----
>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>> Sent: Saturday, February 06, 2010 3:02 PM
>>>> To: Vadlamani, Satish {FLNA}
>>>> Cc: r-help at r-project.org
>>>> Subject: Re: [R] Reading large files
>>>>
>>>> Note that you can shorten #1 to read.csv.sql("out.txt") since your
>>>> other arguments are the default values.
>>>>
>>>> For the second one, use read.csv.sql, eliminate the arguments that are
>>>> defaults anyways (should not cause a problem but its error prone) and
>>>> add an explicit eol= argument since SQLite can have problems with end
>>>> of line in some cases.  Also test out your perl script separately from
>>>> R first to ensure that it works:
>>>>
>>>> test_df <- read.csv.sql(file="3wkoutstatfcst_small.dat", filter="perl
>>>> parse_3wkout.pl", eol = "\n")
>>>>
>>>> SQLite has some known problems with end of line so try it with and
>>>> without the eol= argument just in case.  When I just made up the
>>>> following gawk example I noticed that I did need to specify the eol=
>>>> argument.
>>>>
>>>> Also I have added a complete example using gawk as Example 13c on the
>>>> home page just now:
>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>>>
>>>>
>>>> On Sat, Feb 6, 2010 at 3:52 PM, Vadlamani, Satish {FLNA}
>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>> Gabor:
>>>>>
>>>>> I had success with the following.
>>>>> 1. I created a csv file with a perl script called "out.txt". Then ran the following successfully
>>>>> library("sqldf")
>>>>> test_df <- read.csv.sql(file="out.txt", sql = "select * from file", header = TRUE, sep = ",", dbname = tempfile())
>>>>>
>>>>> 2. I did not have success with the following. Could you tell me what I may be doing wrong? I could paste the perl script if necessary. From the perl script, I am reading the file, creating the csv record and printing each record one by one and then exiting.
>>>>>
>>>>> Thanks.
>>>>>
>>>>> Not had success with below..
>>>>> #test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>>>> test_df
>>>>>
>>>>> Error message below:
>>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>>>> Error in readRegistry(key, maxdepth = 3) :
>>>>>  Registry key 'SOFTWARE\R-core' not found
>>>>> In addition: Warning messages:
>>>>> 1: closing unused connection 14 (3wkoutstatfcst_small.dat)
>>>>> 2: closing unused connection 13 (3wkoutstatfcst_small.dat)
>>>>> 3: closing unused connection 11 (3wkoutstatfcst_small.dat)
>>>>> 4: closing unused connection 9 (3wkoutstatfcst_small.dat)
>>>>> 5: closing unused connection 3 (3wkoutstatfcst_small.dat)
>>>>>> test_df <- read.csv2.sql(file="3wkoutstatfcst_small.dat", sql = "select * from file", header = TRUE, sep = ",", filter="perl parse_3wkout.pl", dbname = tempfile())
>>>>> Error in readRegistry(key, maxdepth = 3) :
>>>>>  Registry key 'SOFTWARE\R-core' not found
>>>>>
>>>>> -----Original Message-----
>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>> Sent: Saturday, February 06, 2010 12:14 PM
>>>>> To: Vadlamani, Satish {FLNA}
>>>>> Cc: r-help at r-project.org
>>>>> Subject: Re: [R] Reading large files
>>>>>
>>>>> No.
>>>>>
>>>>> On Sat, Feb 6, 2010 at 1:01 PM, Vadlamani, Satish {FLNA}
>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>> Gabor:
>>>>>> Can I pass colClasses as a vector to read.csv.sql? Thanks.
>>>>>> Satish
>>>>>>
>>>>>>
>>>>>> -----Original Message-----
>>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>>> Sent: Saturday, February 06, 2010 9:41 AM
>>>>>> To: Vadlamani, Satish {FLNA}
>>>>>> Cc: r-help at r-project.org
>>>>>> Subject: Re: [R] Reading large files
>>>>>>
>>>>>> Its just any Windows batch command string that filters stdin to
>>>>>> stdout.  What the command consists of should not be important.   An
>>>>>> invocation of perl that runs a perl script that filters stdin to
>>>>>> stdout might look like this:
>>>>>>  read.csv.sql("myfile.dat", filter = "perl myprog.pl")
>>>>>>
>>>>>> For an actual example see the source of read.csv2.sql which defaults
>>>>>> to using a Windows vbscript program as a filter.
>>>>>>
>>>>>> On Sat, Feb 6, 2010 at 10:16 AM, Vadlamani, Satish {FLNA}
>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>> Jim, Gabor:
>>>>>>> Thanks so much for the suggestions where I can use read.csv.sql and embed Perl (or gawk). I just want to mention that I am running on Windows. I am going to read the documentation the filter argument and see if it can take a decent sized Perl script and then use its output as input.
>>>>>>>
>>>>>>> Suppose that I write a Perl script that parses this fwf file and creates a CSV file. Can I embed this within the read.csv.sql call? Or, can it only be a statement or something? If you know the answer, please let me know. Otherwise, I will try a few things and report back the results.
>>>>>>>
>>>>>>> Thanks again.
>>>>>>> Saitsh
>>>>>>>
>>>>>>>
>>>>>>> -----Original Message-----
>>>>>>> From: jim holtman [mailto:jholtman at gmail.com]
>>>>>>> Sent: Saturday, February 06, 2010 6:16 AM
>>>>>>> To: Gabor Grothendieck
>>>>>>> Cc: Vadlamani, Satish {FLNA}; r-help at r-project.org
>>>>>>> Subject: Re: [R] Reading large files
>>>>>>>
>>>>>>> In perl the 'unpack' command makes it very easy to parse fixed fielded data.
>>>>>>>
>>>>>>> On Fri, Feb 5, 2010 at 9:09 PM, Gabor Grothendieck
>>>>>>> <ggrothendieck at gmail.com> wrote:
>>>>>>>> Note that the filter= argument on read.csv.sql can be used to pass the
>>>>>>>> input through a filter written in perl, [g]awk or other language.
>>>>>>>> For example: read.csv.sql(..., filter = "gawk -f myfilter.awk")
>>>>>>>>
>>>>>>>> gawk has the FIELDWIDTHS variable for automatically parsing fixed
>>>>>>>> width fields, e.g.
>>>>>>>> http://www.delorie.com/gnu/docs/gawk/gawk_44.html
>>>>>>>> making this very easy but perl or whatever you are most used to would
>>>>>>>> be fine too.
>>>>>>>>
>>>>>>>> On Fri, Feb 5, 2010 at 8:50 PM, Vadlamani, Satish {FLNA}
>>>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>>>> Hi Gabor:
>>>>>>>>> Thanks. My files are all in fixed width format. They are a lot of them. It would take me some effort to convert them to CSV. I guess this cannot be avoided? I can write some Perl scripts to convert fixed width format to CSV format and then start with your suggestion. Could you let me know your thoughts on the approach?
>>>>>>>>> Satish
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> -----Original Message-----
>>>>>>>>> From: Gabor Grothendieck [mailto:ggrothendieck at gmail.com]
>>>>>>>>> Sent: Friday, February 05, 2010 5:16 PM
>>>>>>>>> To: Vadlamani, Satish {FLNA}
>>>>>>>>> Cc: r-help at r-project.org
>>>>>>>>> Subject: Re: [R] Reading large files
>>>>>>>>>
>>>>>>>>> If your problem is just how long it takes to load the file into R try
>>>>>>>>> read.csv.sql in the sqldf package.  A single read.csv.sql call can
>>>>>>>>> create an SQLite database and table layout for you, read the file into
>>>>>>>>> the database (without going through R so R can't slow this down),
>>>>>>>>> extract all or a portion into R based on the sql argument you give it
>>>>>>>>> and then remove the database.  See the examples on the home page:
>>>>>>>>> http://code.google.com/p/sqldf/#Example_13._read.csv.sql_and_read.csv2.sql
>>>>>>>>>
>>>>>>>>> On Fri, Feb 5, 2010 at 2:11 PM, Satish Vadlamani
>>>>>>>>> <SATISH.VADLAMANI at fritolay.com> wrote:
>>>>>>>>>>
>>>>>>>>>> Matthew:
>>>>>>>>>> If it is going to help, here is the explanation. I have an end state in
>>>>>>>>>> mind. It is given below under "End State" header. In order to get there, I
>>>>>>>>>> need to start somewhere right? I started with a 850 MB file and could not
>>>>>>>>>> load in what I think is reasonable time (I waited for an hour).
>>>>>>>>>>
>>>>>>>>>> There are references to 64 bit. How will that help? It is a 4GB RAM machine
>>>>>>>>>> and there is no paging activity when loading the 850 MB file.
>>>>>>>>>>
>>>>>>>>>> I have seen other threads on the same types of questions. I did not see any
>>>>>>>>>> clear cut answers or errors that I could have been making in the process. If
>>>>>>>>>> I am missing something, please let me know. Thanks.
>>>>>>>>>> Satish
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> End State
>>>>>>>>>>> Satish wrote: "at one time I will need to load say 15GB into R"
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> -----
>>>>>>>>>> Satish Vadlamani
>>>>>>>>>> --
>>>>>>>>>> View this message in context: http://n4.nabble.com/Reading-large-files-tp1469691p1470667.html
>>>>>>>>>> Sent from the R help mailing list archive at Nabble.com.
>>>>>>>>>>
>>>>>>>>>> ______________________________________________
>>>>>>>>>> R-help at r-project.org mailing list
>>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>> ______________________________________________
>>>>>>>> R-help at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>>>>>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>>>>>>> and provide commented, minimal, self-contained, reproducible code.
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> Jim Holtman
>>>>>>> Cincinnati, OH
>>>>>>> +1 513 646 9390
>>>>>>>
>>>>>>> What is the problem that you are trying to solve?
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>



More information about the R-help mailing list