[R] filehash error in the colbycol method for as.data.frame from a large object

andrewH ahoerner at rprogress.org
Sun Nov 24 04:38:38 CET 2013


Dear Folks-- 
I have a 14 gig .csv file with 731 columns. I have read it into a colbycol
object (which took overnight – about 16 hours) using the code below, which
produced no warnings or error messages. The object, CPS62_12, is 49 gig.
After the reading, summary() produced the output below and colnames()
successfully returned the names of all 731 columns.


> CPS62_12 <- cbc.read.table(
+ "C:\\R_PROJ\\INEQ_TRENDS\\TESTS\\monofile_ALLVARS\\cps_00078.csv",
+ header = T, sep = "," )
> summary(CPS62_12)
Object of class colbycol with 8093281 rows and 731 columns.
Data for the object is stored at
C:\DOCUME~1\ADMINI~2\LOCALS~1\Temp\RtmpMj3LRP\dir1a6d82a1df37.
> nrow(CPS62_12)
[1] 8093281
> ncol(CPS62_12)
[1] 731
> colnames(CPS62_12)
  [1] "RECTYPE"     "YEAR"        "SERIAL"      "MISH"  <etc.>

I then ran as.data.frame() (code below) and got the following error and
warning message:

> income_HH_CPS <-as.data.frame(CPS62_12,
+ c(YEAR, STATEFIP, RECTYPE, SERIAL, HWTSUPP, HHINCOME, NUMPREC))
Error in readSingleKey(con, map, key) : 
  unable to obtain value for key 'RECTYPE'
In addition: Warning message:
In readKeyMap(filecon) : NAs introduced by coercion

I tried the command on a number of column name combinations and subsequently
always got the error message without the warning. The error is always on
RECTYPE, which is the name of the first column in the csv file. 

I am as yet not able to reproduce this error on a smaller object. I copied
the first 10 lines of my file into an object by using a connection with
readLines. I evaluated the object in the console, and passed the result into
notepad, and saved it. Then I manually sliced off all but the first 15
variables. The resulting file sailed through the code above and produced a
data frame faultlessly. This undercut my leading theory, which was that the
slash double-quotes (/") that bracketed the column names were causing the
problem. 

I tried running cbc.get.col on the second variable in the file, YEAR. These
two commands:

yearCPS <-cbc.get.col(CPS62_12, YEAR)
yearCPS <-cbc.get.col(CPS62_12, 2)

both resulted in the following error message:

Error in readSingleKey(con, map, key) : 
  unable to obtain value for key 'YEAR'

Note that numerical indexing still returned an error on the variable name,
YEAR. I got the same result for several other variables, returning their own
names

I tracked the error message back to the following function in the filehash
package:

readSingleKey <- function(con, map, key) {
        start <- map[[key]]
        if(is.null(start))
                stop(gettextf("unable to obtain value for key '%s'", key))
        seek(con, start, rw = "read")
        unserialize(con)
}

Now I am at a loss. I see that the element “key” of the list “map” has the
value NULL,  that any call to as.data.frame uses RECTYPE as the key, and
that any call to cbc.get.col() uses the passed variable name as a key, even
those that only pass a number.  But I don’t know much of anything about file
hashing, and I have run out of ideas.

Can anyone tell me what I am doing wrong, or whether there is a particular
problem with my file that is likely to be causing this problem, or what my
next diagnostic step should be? Please be aware that I can only do things I
can run on 3 gig of ram.

I am running R under RStudio 0.97.551, on a Widows XP machine with Service
Pack 3.

Sincerely, andrewH




--
View this message in context: http://r.789695.n4.nabble.com/filehash-error-in-the-colbycol-method-for-as-data-frame-from-a-large-object-tp4681052.html
Sent from the R help mailing list archive at Nabble.com.



More information about the R-help mailing list