[R] Errors in data frames from read.table
docpat2511 at yahoo.com
Mon Jul 16 16:13:09 CEST 2007
I am working on a project with a large (~350Mb, about 5800 rows) insurance claims dataset. It was supplied in a tilde(~)-delimited format. I imported it into a data frame in R by setting memory.limit to maximum (4Gb) for my computer and using read.table.
The resulting data frame had 10 bad rows. The errors appear due to read.table missing delimiter characters, with multiple data being imported into the same cell, then the remainder of the row and the next run together and garbled due to the reading frame shift (example: a single cell might contain: <datum>~ ~ <datum> ~<datum>, after which all the cells of the row and the next are wrong).
To replicate, I tried the same import procedure on a smaller demographics data set from the same supplier- only about 1Mb, and got the same kinds of errors (5 bad rows in about 3500). I also imported as much of the file as Excel would hold and cross-checked, Excel did not produce the same errors but can't handle the entire file. I have used read.table on a number of other formats (mainly csv and tab-delimited) without such problems; so far it appears there's something different about these files that produces the errors but I can't see what it would be.
Does anyone have any thoughts about what is going wrong? And is there a way, short of manual correction, for fixing it?
Thanks for all help,
what matters most is how well you walk through the fire.
More information about the R-help