[R] Manage huge database

Mon Sep 22 12:53:27 CEST 2008

2008/9/22 José E.  Lozano <lozalojo at jcyl.es>:

> Exactly, raw data, but a little more complex since all the 500000 variables
> are in text format, so the width is around 2,500,000.

> Thanks, I will check. Right now I am reading line by line the file. It's
> time consuming, but since I will do it only once, just to rearrange the data
> into smaller tables to query, it's ok.

  A language like python, perl, or even awk might be able to help you
slice your data up.

> Is genetic DNA data (individuals genotyped), hence the large amount of
> columns to analyze.

 So is each line just ACCGTATAT etc etc?

 If you have fixed width fields in a file, so that every line is the
same length, then you can use random access methods to get to a
particular value - just multiply the line length by the row number you
want and add the column number. In R you can do this with seek() on a
connection. This should be fast because it seeks by bytes, instead of
having to scan all the comma-separated stuff. The only problem comes
when your data doesn't quite conform, and you can end up reading junk.
When doing this, it's a good idea to test your dataset first to make
sure the lines and fields are right.

Example with dummy.dna:

aaaccctttgggaaa
gattacagattacaa
aaaaaaacccccggg
gggggtgggggtggg
aaaaaaaaaaccccc

 each line has 15 bases, and on my OS there's one additional invisible
character to mark the line end. Windows uses 2, but your data might
not be Windows format... So anyway, my multiplier is 16. Hence to get
a slice of the file of four columns from column 7 for some rows:

> dna=file("dummy.dna")
> open(dna,open="rb")
> for(r in 2:4){seek(dna,7+(r-1)*16);print(readChar(dna,4))}
[1] "gatt"
[1] "cccc"
[1] "gggg"

 The speed of this should be independent of the size of your data file.

Barry