[R] big data?

Thu Aug 7 19:49:00 CEST 2014

correcting a typo (400 MB, not GB.  Thanks to David Winsemius for 
reporting it).  Spencer

###############

       Thanks to all who replied.  For the record, I will summarize here 
what I tried and what I learned:

       Mike Harwood suggested the ff package.  David Winsemius suggested 
data.table and colbycol.  Peter Langfelder suggested sqldf.

       sqldf::read.csv.sql allowed me to create an SQL command to read a 
column or a subset of the rows of a 400 MB tab-delimited file in roughly 
a minute on a 2.3 GHz dual core machine running Windows 7 with 8 GB RAM. 
  It also read a column of a 1.3 GB file in 4 minutes.  The 
documentation was sufficient to allow me to easily get what I wanted 
with a minimum of effort.

       If I needed to work with these data regularly, I might experiment 
with colbycol and ff:  The documentation suggested to me that these 
packages might allow me to get quicker answers from routine tasks after 
some preprocessing.  Of course, I could also do the preprocessing 
manually with sqldf.

       Thanks, again.
       Spencer

On 8/6/2014 9:39 AM, Mike Harwood wrote:
> The read.table.ffdf function in the ff package can read in delimited files
> and store them to disk as individual columns.  The ffbase package provides
> additional data management and analytic functionality.  I have used these
> packages on 15 Gb files of 18 million rows and 250 columns.
>
>
> On Tuesday, August 5, 2014 1:39:03 PM UTC-5, David Winsemius wrote:
>>
>> On Aug 5, 2014, at 10:20 AM, Spencer Graves wrote:
>>
>>>       What tools do you like for working with tab delimited text files up
>> to 1.5 GB (under Windows 7 with 8 GB RAM)?
>>
>> ?data.table::fread
>>
>>>       Standard tools for smaller data sometimes grab all the available
>> RAM, after which CPU usage drops to 3% ;-)
>>>
>>>       The "bigmemory" project won the 2010 John Chambers Award but "is
>> not available (for R version 3.1.0)".
>>>
>>>       findFn("big data", 999) downloaded 961 links in 437 packages. That
>> contains tools for data PostgreSQL and other formats, but I couldn't find
>> anything for large tab delimited text files.
>>>
>>>       Absent a better idea, I plan to write a function getField to
>> extract a specific field from the data, then use that to split the data
>> into 4 smaller files, which I think should be small enough that I can do
>> what I want.
>>
>> There is the colbycol package with which I have no experience, but I
>> understand it is designed to partition data into column sized objects.
>> #--- from its help file-----
>> cbc.get.col {colbycol}        R Documentation
>> Reads a single column from the original file into memory
>>
>> Description
>>
>> Function cbc.read.table reads a file, stores it column by column in disk
>> file and creates a colbycol object. Functioncbc.get.col queries this object
>> and returns a single column.
>>
>>>       Thanks,
>>>       Spencer
>>>
>>> ______________________________________________
>>> R-h... at r-project.org <javascript:> mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>> David Winsemius
>> Alameda, CA, USA
>>
>> ______________________________________________
>> R-h... at r-project.org <javascript:> mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>