[R] How to save very large matrix?

Hervé Pagès hpages at fhcrc.org
Wed Oct 30 00:14:13 CET 2013


Hi Petar,

If you're going to share this matrix across R sessions, save()/load() is
probably one of your best options.

Otherwise, you could try the rhdf5 package from Bioconductor:

1. Install the package with:

      source("http://bioconductor.org/biocLite.R")
      biocLite("rhdf5")

2. Then:

      library(rhdf5)

      h5createFile("my_big_matrix.h5")

      # write a matrix
      my_big_matrix <- matrix(runif(5000*10000), nrow=5000)
      attr(my_big_matrix, "scale") <- "liter"
      h5write(my_big_matrix, "my_big_matrix.h5", "my_big_matrix")  # 
takes 1 min.
      # file size on disk is 248M

      # read a matrix
      my_big_matrix <- h5read("my_big_matrix.h5", "my_big_matrix")  # 
takes 7.4 sec.

Multiply the above numbers (obtained on a laptop with a traditional
hard drive) by 100 for your monster matrix, or less if you have super
fast I/O.

2 advantages of using the HDF5 format: (1) should not be too hard to use
the HDF5 C library in the C code you're going to use to read the matrix,
and (2) my understanding is that HDF5 is good at letting you access
arbitrary slices of the data so chunk-processing should be easy and
efficient:

   http://www.hdfgroup.org/HDF5/

Cheers,
H.


On 10/29/2013 02:34 PM, Petar Milin wrote:
> Hello,
>
> On Oct 29, 2013, at 10:16 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>
>> On 29/10/2013 20:42, Rui Barradas wrote:
>>> Hello,
>>>
>>> You can use the argument to write.csv or write.table  append = TRUE to
>>> write the matrix in chunks. Something like the following.
>>
>> That was going to be my suggestion. But the reason long vectors have not been implemented is that is rather implausible to be useful.   A text file with the values of such a numeric matrix is likely to be 100GB. What are you going to do with such a file?  For transfer to another program I would seriously consider a binary format (e.g. use writeBin), as it is the conversion to and from text that is time consuming.
>
> I need to submit it to a cluster analysis (k-means). From an independent source I have been advised to use means algorithm written in C which is very fast and efficient. It asks for a txt file as an input.
>
> I tried few options in R, where I am more comfortable, but solution never came, even after too many hours.
>
> Thanks!
> Best,
> PM
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the R-help mailing list