[BioC] Fastest way to read CSV files

Misha Kapushesky ostolop at ebi.ac.uk
Fri Aug 20 16:01:37 CEST 2010


Hi,

Martin is absolutely right. For our data analysis needs here we use NetCDF 
extensively. It's about as fast as direct binary reads, is portable, etc., 
without the headache of worrying about many nitty gritty details.

--Misha

> data) and robust. Also SQL, NetCDF and friends which will be portable /
> interoperable.
>
> Depending on use case, it can be tricky to get good timings on these
> operations -- your OS has probably cached those values when written, so
> input seems very fast, whereas when they've been removed from cache the
> first access could be considerably slower (order of magnitude is my
> casual impression).
>
> Martin
>>
>>
>> best,
>> Stijn
>>
>>
>> On Fri, Aug 20, 2010 at 09:45:14AM +0100, Misha Kapushesky wrote:
>>> Hi,
>>>
>>> If you did do this in binary, we'd see the following:
>>>
>>>> x <- matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
>>>> z <- writeBin(as.vector(x),file("test.bin","wb"))
>>>
>>>> system.time({zz <- readBin(file("test.bin","rb"),numeric(),20*1700000);
>>>> dim(zz) <- c(20,1700000)})
>>>    user  system elapsed
>>>   0.171   0.574   0.751
>>>
>>> So, less than a second to read this in.
>>>
>>> If you were working in, say, Perl, you could write data like this as
>>> follows:
>>>
>>> open M, ">test2.bin";
>>> for($i=0; $i<20*1700000; $i++) {
>>>   print M pack('i',$i);
>>> }
>>> close M;
>>>
>>> and read that file into R as:
>>>
>>>> system.time({e <- readBin("test2.bin",integer(),20*1700000,size=4);
>>> dim(e) <- c(20,1700000)})
>>>    user  system elapsed
>>>   0.093   0.273   0.370
>>>
>>> Even faster, specifying explicitly the int size.
>>>
>>> --Misha
>>>
>>> On Thu, 19 Aug 2010, Sean Davis wrote:
>>>
>>>> On Thu, Aug 19, 2010 at 7:31 PM, Stijn van Dongen <stijn at ebi.ac.uk> wrote:
>>>>
>>>>>
>>>>> This piqued my interest, as for really large datasets it can in general
>>>>> speed
>>>>> up things greatly to use binary formats (1.5 million does not sound *that*
>>>>> big
>>>>> to me). I have no experience with this in R, but a little search brought
>>>>> up
>>>>> e.g. readBin(). So it might be possible, especially if your data is quite
>>>>> simple (all integers), to first convert your data externally to a binary
>>>>> format (using perl or python or ..) and then read it with readBin().
>>>>>
>>>>> Disclaimer: Quite likely a random thought from an ill-informed bystander.
>>>>>
>>>>>
>>>> Binary is always a good thought, but reading into another language to write
>>>> binary to load into R is probably not going to be a big time saver over
>>>> using R's capabilities.
>>>>
>>>>> x=matrix(floor(runif(1.7e6 * 20)*1000),nr=20)
>>>> di> dim(x)
>>>> [1]      20 1700000
>>>>> write.table(x,file='abc.txt',sep="\t",col.names=FALSE,row.names=FALSE)
>>>>> system.time((y = matrix(scan('abc.txt',what='integer'),nr=20)))
>>>> Read 34000000 items
>>>>  user  system elapsed
>>>> 17.555   0.685  18.258
>>>>> dim(y)
>>>> [1]      20 1700000
>>>>
>>>> So, a 1.7 million column by 20 row table of integers can be read in about
>>>> 18
>>>> seconds using scan, just to give a rough sketch of profiling results.  You
>>>> might be able to get close using read.table and setting column classes
>>>> appropriately, also.
>>>>
>>>> Sean
>>>>
>>>>
>>>>> best,
>>>>> Stijn
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On Thu, Aug 19, 2010 at 05:43:22PM -0400, Sean Davis wrote:
>>>>>> Try using scan and then rearrange the resulting vector.
>>>>>>
>>>>>> Sean
>>>>>>
>>>>>> On Aug 19, 2010 5:32 PM, "Gaston Fiore" <gaston.fiore at gmail.com> wrote:
>>>>>>
>>>>>> Hello everyone,
>>>>>>
>>>>>> Is there a faster method to read CSV files than the read.csv function?
>>>>> I've
>>>>>> CSV files containing a rectangular array with about 17 rows and 1.5
>>>>> million
>>>>>> columns with integer entries, and read.csv is being too slow for my
>>>>> needs.
>>>>>>
>>>>>> Thanks for your help,
>>>>>>
>>>>>> -Gaston
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>>      [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at stat.math.ethz.ch
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>> --
>>>>> Stijn van Dongen         >8<        -o)   O<  forename pronunciation:
>>>>> [Stan]
>>>>> EMBL-EBI                            /\\   Tel: +44-(0)1223-492675
>>>>> Hinxton, Cambridge, CB10 1SD, UK   _\_/   http://micans.org/stijn
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at stat.math.ethz.ch
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> 	[[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at stat.math.ethz.ch
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>
>
>
> -- 
> Martin Morgan
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
>
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
>



More information about the Bioconductor mailing list