[BioC] rhdf5 and factors

Wolfgang Huber whuber at embl.de
Thu Jan 31 14:06:46 CET 2013


Dear Martin

thank you for digging into this. I agree that it should not be hard to use (r)hdf5 as a storage backend for any R object by recursively saving the object in terms of its  simple type components. My question is what would be the use case for that (given that we have already 'save' and 'load' in the base package)?

The use cases that we have been thinking about here, so far, involve:
(i) efficient r/w access to subarrays (hyperslabs) in particular for very large arrays (which don't reside in memory)
(ii) inter-language exchange of data

For both of these, it is clearly useful to deal with basic array types, and perhaps less so for more complex (non-sequential or R-idiosyncratic) objects.

Or are you thinking of
(iii) creating a full-fledged alternative to base:save, base:load?

	Best wishes
	Wolfgang




Il giorno Jan 27, 2013, alle ore 9:40 PM, Martin Morgan <mtmorgan at fhcrc.org> ha scritto:

> On 01/27/2013 09:42 AM, Bernd Fischer wrote:
>> Dear Moritz!
>> 
>> An easy solution for you would be to separately write the factor-values (the integers)
>> and the levels:
>> 
>>> h5write(as.integer(obj), file=file, name="objCODES")
>>> h5write(levels(obj), file=file, name="objLEVELS")
> 
> I was thinking this would work
> 
> f = factor("M", "F")
> h5createFile(fl <- tempfile())
> res = h5write(f, fl, write.attributes=TRUE, name="f")
> 
> but the last line fails ('no applicable method for 'h5writeDataset' applied to an object of class "factor"') so then tried
> 
> res = h5write(unclass(f), fl, write.attributes=TRUE, name="f")
> 
> which doesn't fail but doesn't seem to work?
> 
>> dput(h5read(fl, "f", read.attributes=TRUE))
> structure(c(2L, 1L), .Dim = 2L)
>> dput(unclass(f))
> structure(c(2L, 1L), .Label = c("F", "M"))
> 
> I initially went down this line thinking that since a factor (and many other R entities) are just basic types + attributes, it would be easy to support serializing a broad range of R data types (read/write.attributes=TRUE would be a better default if the objective was to provide a transparent way to use hdf5 as a storage back-end, which I think would be cool). But maybe there's not intention, getting back to the original poster's question, to support this kind of high-level functionality in this package? Or maybe there's scope for an elegant (because one just has to recurse through an R object to save it) additional package that extends rhdf5?
> 
> Martin
> 
> 
>> 
>> Best,
>> 
>> Bernd
>> 
>> 
>> 
>> --
>> Bernd Fischer
>> EMBL Heidelberg
>> Meyerhofstraße 1
>> 69117 Heidelberg
>> Tel: +49 [0] 6221 387-8131
>> E-Mail: bernd.fischer at embl.de
>> Homepage: http://www-huber.embl.de/users/befische/
>> 
>> 
>> 
>> 
>> 
>> 
>> On 23.01.2013, at 16:05, Moritz Emanuel Beber <moritz.beber at gmail.com> wrote:
>> 
>>> Dear all,
>>> 
>>> I sent a message to Bernd Fischer the maintainer of rhdf5 directly but got no response from him. My qualm lies with the writing and re-reading of factor vectors using rhdf5. In the current release they are simply written as integers and upon reading the HDF5 files the levels are obviously forgotten.
>>> 
>>> Of course, I could convert the factors to character vectors before writing but I wanted to ask whether there is a plan to implement better factor support or if it's feasible to contribute code to facilitate such support.
>>> 
>>> TIA,
>>> Moritz
>>> 
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
>> 
>> 	[[alternative HTML version deleted]]
>> 
>> 
>> 
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>> 
> 
> 
> -- 
> Computational Biology / Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N.
> PO Box 19024 Seattle, WA 98109
> 
> Location: Arnold Building M1 B861
> Phone: (206) 667-2793
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor



More information about the Bioconductor mailing list