[R] trouble with read.table and colClasses='raw'

Henrik Bengtsson hb at stat.berkeley.edu
Fri Feb 12 15:53:56 CET 2010


The OP may be interested in using low-level readBin() and writeBin()
instead.  One can then either assign dimension attributes to the
object to access the data as a matrix/an array.  Note that assigning
dimension attributes will probably(?) allocate a copy.  If that is not
wanted, it is not that hard to translate matrix/array indices into
1D-vector indices and vice versa, cf. arrayIndex() of R.utils.
readBin()/writeBin() would also allow you use random-access to any
part of the data [which read.table() won't].

To me it sounds a bit odd to use read.table() for "raw" data; it is
intended to be used with (ASCII only?) text files that do not have a
fixed number of symbols per entry and row, but instead rely on
separators and newlines to identify the rectangular structure.
readBin()/writeBin() would certainly provide a more compact file
format.  Also, I do not think it would allow you to use NUL (\000),
because that is reserved as the end-of-string symbol.

My $.02

/Henrik



On Fri, Feb 12, 2010 at 1:33 AM, jim holtman <jholtman at gmail.com> wrote:
> What you might consider is to use save/load for storing the data in a
> format that is easily accessible in R, and then using write.table for
> creating a character based output for other external programs.  For
> the size files you are working with, this is the easiest and fastest
> way of doing it.
>
> On Thu, Feb 11, 2010 at 4:08 PM, Johan Jackson
> <johan.h.jackson at gmail.com> wrote:
>> Apologies for my sarcastic/defensive reply email Peter.
>>
>> The issue is that I need this matrix to be read into other programs - not
>> just R, so save() won't work. I like 'raw' mode because it saves so much
>> space, but it's difficult to work with. This read/write issue is but one
>> example; another is that R will try to convert the raw matrix to, e.g.,
>> double, if you forget and assign any element of it to be double (personally,
>> I'd prefer there to be an option, set in options(), for R to downcast the
>> variable to raw and give you a warning).
>>
>> Anyway, I've been working with R a bit, but I've come to the conclusion that
>> it is just not user-friendly when it comes to large datasets. I've tried
>> some of the large data packages but at least all that I've tried have their
>> own sets of issues. As much as it pains me to say it, I may go back to SAS
>> when working on such projects...
>>
>> Best,
>>
>> JJ
>>
>>
>>
>>
>>
>> On Thu, Feb 11, 2010 at 1:19 PM, Peter Ehlers <ehlers at ucalgary.ca> wrote:
>>
>>> Johan,
>>>
>>> My apologies if you took my comments to be sarcastic; they were
>>> certainly not meant to be. I have no desire to put you or anyone
>>> down.
>>>
>>> I see now that you want to somehow store data more 'efficiently',
>>> presumably in order to be able to handle larger objects in RAM.
>>>
>>> I doubt that storage.mode raw will help. Your post implied that
>>> you had saved an object and couldn't read it back into the same
>>> format in which you think it was saved. So, did you have 16Gb
>>> object to save? And why wouldn't you use save()? It's just a
>>> guess, but I think you may have a file of _character_ data that
>>> you want to read into R where its storage mode should be 'raw'.
>>> I don't know how to do that.
>>>
>>> If the main purpose is to circumvent R's memory requirements,
>>> then there have been plenty of posts on that issue.
>>>
>>>  -Peter Ehlers
>>>
>>>
>>> Johan Jackson wrote:
>>>
>>>> "I suspect that you really don't know what 'raw' type means and haven't
>>>> bothered to check ?raw. It's also pretty clear that you haven't read the
>>>> colClasses description in ?read.table very carefully."
>>>>
>>>> Gee, thanks Peter (this is what I love about the R help boards: people
>>>> whose
>>>> sole goal is to put others down as wittily as possible for asking *stupid
>>>> stupid* questions). Gives me warm fuzzies :)
>>>>
>>>> Although I admit to not being the brightest of folks around, or knowing R
>>>> backwards and forwards, I did read ?read.table and ?raw. But your
>>>> suggestion
>>>> is not at all helpful Peter:
>>>>
>>>> dat <- read.table(file="data", header=TRUE, colClasses="character") #wow!
>>>> it
>>>> works on a 5x3 matrix! amazing!! (sarcasm)
>>>>
>>>> dat2 <- as.matrix(dat)
>>>> storage.mode(dat2) <- 'raw'
>>>>
>>>> if I had wanted 'character' data, I would have put that into my question.
>>>> Any newbie can do what you did; the issue is that object.size(dat) is
>>>> about
>>>> 8 times larger than object.size(dat2) with any large dataset. That's why I
>>>> want to store it as 'raw' - because the raw one takes about 2 Gb RAM and
>>>> the
>>>> other about 16Gb! Perhaps you need to understand the raw mode a bit
>>>> better,
>>>> Peter, because I thought the reason for wanting the data in 'raw' was
>>>> quite
>>>> obvious, but I guess not.
>>>>
>>>> Peter, here's what I want you to do. Use R to make a vector with 2^31 - 5
>>>> elements in it. Hey, make it of mode 'character' while you're at it! Write
>>>> it out. Read it back in. Having problems? Then come talk to me...
>>>>
>>>> JJ
>>>>
>>>>  [....]
>>>
>>> --
>>> Peter Ehlers
>>> University of Calgary
>>>
>>>
>>
>>        [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
>
> --
> Jim Holtman
> Cincinnati, OH
> +1 513 646 9390
>
> What is the problem that you are trying to solve?
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list