[R] filehash for big data

jim holtman jholtman at gmail.com
Sun Jan 2 23:08:00 CET 2011


Exactly how do you want to work with this data?  How do you want it
organized?  What is the structure of the file that you want to read
in?  What types of analysis are you going to do?  Does all the data
have to be in memory at once, or can you construct your analysis to do
it in pieces and the aggregate the summary data?  There is some
missing information before trying to propose a solution.

For example, do you need all the data in memory at one time (if it is
all doubles, you would need 800MB for a single copy).  Are you running
on a 64-bit version of the operating system?  If so, I would suggest
that you have at least 4GB of real memory for R so that you could have
multiple copies that will probably be created by some of the
processing.

Why are you considering filehash and not a relational database to
store/extract the data?  You can always read in a portion of the data
and then transfer it to the appropriate storage type.  No reason for R
to "choke" reading in the data if you have structured the input/output
files appropriately.

On Sun, Jan 2, 2011 at 2:14 PM, michael curran <michcurran at yahoo.com> wrote:
> Hi all,
>
> I am trying to use the filehash library to analyze a 5M by 20 matrix with both
> double and string data types.
>
>
> After consulting a few tutorials online, it seems as though one needs to first
> read the data into R; then create an R object; and then assign that object a
> location in my computer via filehash. It seems like the benefit of this is
> minimizing memory allocation when running subsequent analysis (e.g., descriptive
>
> statistics, regressions, etc.) .
>
>
> My question is: what happens if R chokes when trying to read in the data (i.e.,
> step 1)? Is there another library I can use to get the data read in or,
> alternatively, am I misunderstanding the complete functionality of the filehash
> library and what it can do?
>
>
> Apologies if this a basic question--usually I work with considerably smaller
> data frames and don't have much experience with memory issues and R.
>
>
> Thanks in advance for any advice/pointers.
>
> Best, Mike
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Jim Holtman
Data Munger Guru

What is the problem that you are trying to solve?



More information about the R-help mailing list