[R] Manage huge database

jim holtman jholtman at gmail.com
Mon Sep 22 12:50:06 CEST 2008


What are you going to do with the data once you have read it in?  Are
all the data items numeric?  If they are numeric, you would need at
least 8GB to hold one copy and probably a machine with 32GB if you
wanted to do any manipulation on the data.

You can use a 'connection' and 'scan' to read the data in chunks and
then store it in a more accessible format.  A lot would depend on your
answer to my first question.

On Mon, Sep 22, 2008 at 6:26 AM, José E. Lozano <lozalojo at jcyl.es> wrote:
>
> > Maybe you've not lurked on R-help for long enough :) Apologies!
>
> Probably.
>
> > So, how much "design" is in this data? If none, and what you've
> > basically got is a 2000x500000 grid of numbers, then maybe a more raw
>
> Exactly, raw data, but a little more complex since all the 500000 variables
> are in text format, so the width is around 2,500,000.
>
> > http://cran.r-project.org/web/packages/RNetCDF/index.html
> > http://cran.r-project.org/web/packages/hdf5/index.html
>
> Thanks, I will check. Right now I am reading line by line the file. It's
> time consuming, but since I will do it only once, just to rearrange the data
> into smaller tables to query, it's ok.
>
> > Thinking back to your 4GB file with 1,000,000,000 entries, that's
> > only 3 bytes per entry (+1 for the comma). What is this data? There
> > may be more efficient ways to handle it.
>
> Is genetic DNA data (individuals genotyped), hence the large amount of
> columns to analyze.
>
> Best Regards,
> Jose Lozano
> ------------------------------------------
> Jose E. Lozano Alonso
> Observatorio de Salud Pública.
> Direccion General de Salud Pública e I+D+I.
> Junta de Castilla y León.
> Direccion: Paseo de Zorrilla, nº1. Despacho 3103. CP 47071. Valladolid.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



--
Jim Holtman
Cincinnati, OH
+1 513 646 9390

What is the problem that you are trying to solve?



More information about the R-help mailing list