[BioC] Fastest way to read CSV files

Fri Aug 20 15:26:37 CEST 2010

Thanks Misha, that's very instructive.
I'd like to add that this can be made quite parametrizable, in that it is
possible to write and read the dimensions of the object as well. In fact, by
writing some kind of 'cookie' number it would be possible to have code that can
recognize what *type* of data it needs to read.  In the example below however,
just the dimensions are first written to and then read from file. When reading,
the dimensions are no longer hardcoded, but read from the same connection.

   x <- matrix(floor(runif(1.7e4 * 20)*1000),nr=20)
   cn <- file("test.bin","wb")
   writeBin(dim(x), cn)
   writeBin(as.vector(x), cn)
   close(cn)

   cn <- file("test.bin", "rb")
   dims <- readBin(cn, integer(), 2)
   x2 <- matrix(readBin(cn,numeric(), dims[1] * dims[2]), nrow=dims[1], ncol=dims[2])
   close(cn)

   sum(x != x2)

a hex dump of the file test.bin gives this for the first line:

        <----integer 1 ---> <--- integer 2 --->
0000000 0014 0000 4268 0000 0000 0000 c000 4070

indeed, hexadecimal 0x14 == 20 and hexadecimal 4268 == 17000,
this on a little endian machine.

best,
Stijn

On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote:
> Hi,
> 
> If you did do this in binary, we'd see the following:
> 
> >x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
> >z <- writeBin(as.vector(x),file("test.bin","wb"))
> 
> >system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000); 
> >dim(zz) <- c(20,1700000)})
>    user  system elapsed
>   0.171   0.574   0.751
> 
> So, less than a second to read this in.
> 
> If you were working in, say, Perl, you could write data like this as 
> follows:
> 
> open M, ">test2.bin";
> for($i=0; $i<20*1700000; $i++) {
>   print M pack('i',$i);
> }
> close M;
> 
> and read that file into R as:
> 
> >system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4); 
> dim(e) <- c(20,1700000)})
>    user  system elapsed
>   0.093   0.273   0.370
> 
> Even faster, specifying explicitly the int size.
> 
> --Misha
> 
> On Thu, 19 Aug 2010, Sean Davis wrote:
> 
> >On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
> >
> >>
> >>This piqued my interest, as for really large datasets it can in general
> >>speed
> >>up things greatly to use binary formats (1.5 million does not sound *that*
> >>big
> >>to me). I have no experience with this in R, but a little search brought 
> >>up
> >>e.g. readBin(). So it might be possible, especially if your data is quite
> >>simple (all integers), to first convert your data externally to a binary
> >>format (using perl or python or ..) and then read it with readBin().
> >>
> >>Disclaimer: Quite likely a random thought from an ill-informed bystander.
> >>
> >>
> >Binary is always a good thought, but reading into another language to write
> >binary to load into R is probably not going to be a big time saver over
> >using R's capabilities.
> >
> >>x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
> >di> dim(x)
> >[1]      20 1700000
> >>write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE)
> >>system.time((y = matrix(scan('abc.txt',what='integer'),nr=20)))
> >Read 34000000 items
> >  user  system elapsed
> >17.555   0.685  18.258
> >>dim(y)
> >[1]      20 1700000
> >
> >So, a 1.7 million column by 20 row table of integers can be read in about 
> >18
> >seconds using scan, just to give a rough sketch of profiling results.  You
> >might be able to get close using read.table and setting column classes
> >appropriately, also.
> >
> >Sean
> >
> >
> >>best,
> >>Stijn
> >>
> >>
> >>
> >>
> >>On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote:
> >>>Try using scan and then rearrange the resulting vector.
> >>>
> >>>Sean
> >>>
> >>>On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at gmail.com> wrote:
> >>>
> >>>Hello everyone,
> >>>
> >>>Is there a faster method to read CSV files than the read.csv function?
> >>I've
> >>>CSV files containing a rectangular array with about 17 rows and 1.5
> >>million
> >>>columns with integer entries, and read.csv is being too slow for my
> >>needs.
> >>>
> >>>Thanks for your help,
> >>>
> >>>-Gaston
> >>>
> >>>_______________________________________________
> >>>Bioconductor mailing list
> >>>Bioconductor at stat.math.ethz.ch
> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>Search the archives:
> >>>http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>>
> >>>      [[alternative HTML version deleted]]
> >>>
> >>>_______________________________________________
> >>>Bioconductor mailing list
> >>>Bioconductor at stat.math.ethz.ch
> >>>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>>Search the archives:
> >>http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >>--
> >>Stijn van Dongen         >8<        -o)   O<  forename pronunciation:
> >>[Stan]
> >>EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
> >>Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn
> >>
> >>_______________________________________________
> >>Bioconductor mailing list
> >>Bioconductor at stat.math.ethz.ch
> >>https://stat.ethz.ch/mailman/listinfo/bioconductor
> >>Search the archives:
> >>http://news.gmane.org/gmane.science.biology.informatics.conductor
> >>
> >
> >	[[alternative HTML version deleted]]
> >
> >_______________________________________________
> >Bioconductor mailing list
> >Bioconductor at stat.math.ethz.ch
> >https://stat.ethz.ch/mailman/listinfo/bioconductor
> >Search the archives: 
> >http://news.gmane.org/gmane.science.biology.informatics.conductor
> >

-- 
Stijn van Dongen         >8<        -o)   O<  forename pronunciation: [Stan]
EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn