[R] How to save very large matrix?
hpages at fhcrc.org
Wed Oct 30 00:14:13 CET 2013
If you're going to share this matrix across R sessions, save()/load() is
probably one of your best options.
Otherwise, you could try the rhdf5 package from Bioconductor:
1. Install the package with:
# write a matrix
my_big_matrix <- matrix(runif(5000*10000), nrow=5000)
attr(my_big_matrix, "scale") <- "liter"
h5write(my_big_matrix, "my_big_matrix.h5", "my_big_matrix") #
takes 1 min.
# file size on disk is 248M
# read a matrix
my_big_matrix <- h5read("my_big_matrix.h5", "my_big_matrix") #
takes 7.4 sec.
Multiply the above numbers (obtained on a laptop with a traditional
hard drive) by 100 for your monster matrix, or less if you have super
2 advantages of using the HDF5 format: (1) should not be too hard to use
the HDF5 C library in the C code you're going to use to read the matrix,
and (2) my understanding is that HDF5 is good at letting you access
arbitrary slices of the data so chunk-processing should be easy and
On 10/29/2013 02:34 PM, Petar Milin wrote:
> On Oct 29, 2013, at 10:16 PM, Prof Brian Ripley <ripley at stats.ox.ac.uk> wrote:
>> On 29/10/2013 20:42, Rui Barradas wrote:
>>> You can use the argument to write.csv or write.table append = TRUE to
>>> write the matrix in chunks. Something like the following.
>> That was going to be my suggestion. But the reason long vectors have not been implemented is that is rather implausible to be useful. A text file with the values of such a numeric matrix is likely to be 100GB. What are you going to do with such a file? For transfer to another program I would seriously consider a binary format (e.g. use writeBin), as it is the conversion to and from text that is time consuming.
> I need to submit it to a cluster analysis (k-means). From an independent source I have been advised to use means algorithm written in C which is very fast and efficient. It asks for a txt file as an input.
> I tried few options in R, where I am more comfortable, but solution never came, even after too many hours.
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the R-help