[R] Another big data size problem

Wed Jul 28 17:04:57 CEST 2004

> -----Original Message-----
> From: r-help-bounces at stat.math.ethz.ch 
> [mailto:r-help-bounces at stat.math.ethz.ch] On Behalf Of 
> Federico Gherardini
> Sent: Wednesday, July 28, 2004 5:26 PM
> To: r-help at stat.math.ethz.ch
> Subject: Re: [R] Another big data size problem
> 
> 
> On Wed, 28 Jul 2004 13:28:20 +0100
> Ernesto Jardim <ernesto at ipimar.pt> wrote:
> 
> 
> > Hi,
> > 
> > When you're writing a table to MySQL you have to be carefull if the 
> > table is created by RMySQL. The fields definition may not 
> be the most 
> > adequate and there will be no indexes in your table, which 
> makes the 
> > queries _very_ slow.
> > 
> So, if I understood correctly, if you want to use SQL you'll 
> have to upload the table in SQL, directly from MySQL without 
> using R at all, and then use RMySQL to read the elements in R?
> 
> Uwe Ligges <ligges at statistik.uni-dortmund.de> wrote:
> 
> >Note that it is better to initialize the object to full size before
> >inserting -- rather than using rbind() and friends which is 
> indeed slow
> >since it need to re-allocate much memory for each step.
> 
> Do you mean something like this?
> 
> tab <- matrix(rep(0, 1227 * 20000), 1227, 20000, byrow = TRUE)
> 
> for(i in 0:num.lines)
> 	tab[i + 1,] <- scan(file=fh, nlines=1, what="PS", skip = i)

It is better to open a file connection, keep it open during the loop and the
close it afterwards. Something like

  tab <- matrix(rep(0, 1227 * 20000), 1227, 20000, byrow = TRUE)
  fh <- file(filename, open="r");
  for(i in 0:num.lines)
    tab[i + 1,] <- scan(file=fh, nlines=1);
  close(fh);

As you have done it, the file is opened once in each iteration of the loop,
scan() starts reading from the beginning, parse all lines to skip 'i' lines,
and the reads one line. This is done num.lines+1 times!

Anyway, I think you also should read the help for scan(). What do you want
with argument 'what="PS"'? "PS" is not a valid data type; 'what' does not
specify a name of field/column to be read.

> The above doesn't get very far either... it seems that, once 
> it has created the table, it becomes so slow that it's 
> unusable. I'll have to try this with more RAM by the way.

My suggestions to you are that try read.table() with specified data type for
the columns using vector argument 'colClasses'. This way you can help R by
specify that, say, column 3 is an integer (have the memory of a double), and
that column 6-10 are doubles. Unfortunately you can tell read.table() to
skip some of the columns that you are not interested in, which in your case
to help you out a lot. To do this, you have to use scan(), which
read.table() uses internally. In scan() 'what' works similar to 'colClasses'
*and* if you specify 'what' as a 'list' you can tell scan() to skip some
columns by setting its 'what' value to NULL, e.g. what=list("integer",
"integer", NULL, "double", "character"). I think you can get pretty far
doing this!

> Cheers,
> 
> fede

Good luck!

Henrik Bengtsson