[R] Diagnostic and helper functions for defective & hard-to-import files
Richard M. Heiberger
rmh at temple.edu
Thu Jan 30 06:59:21 CET 2014
For Census data specifically you might want to look at the acs package.
I learned about it from
Ray DiGiacomo, Jr. rayd at liondatasystems.com
I am doing an R-oriented US Census webinar on August 29.
See this site for the recording of the webinar.
I think you will find this webinar interesting as it will debut the
new "acs" R package.
Ray DiGiacomo, Jr.
Master R Trainer
Healthcare Predictive Analytics Specialist
President, Lion Data Systems LLC
rayd at liondatasystems.com
Irvine, California USA
On Wed, Jan 29, 2014 at 2:12 AM, andrewH <ahoerner at rprogress.org> wrote:
> On Jan 28, 2014 at 8:56pm, David Winsemius wrote:
> On Jan 28, 2014, at 8:43 PM, andrewH wrote:
>> Hi Folks!
>> I have been writing a small set of utilities for dealing with files that
>> hard to open correctly for one reason or another, especially because they
>> are too big for memory, non-rectangular, or contain odd characters or
>> unexpected codings, or all of these things together. Today it suddenly hit
>> me that this has probably been done, done better, and upgraded to package
>> form a dozen times already. There were pointers to a couple functions
>> in this regard in the Core Import/Export document. But my effort to come
>> with search terms that were productive of such packages was unsuccessful.
> I don't know of a package to do that. You know the quote from that Russian
> author whose name I am forgetting (in "Anna Karinena" perhaps) about happy
> families being all the same but unhappy families being impossible to
> classify. I think it applies to datasets as well. There are too many
> different dataset pathologies to allow a neat packaging approach.
> My approach has been to study the options in read.table very carefully and
> if that is insufficient look at either readLines or scan as options. It is
> very useful to be able to use `count.fields` with different parameter
> settings of "quotes" and comment.char". Wrapping it in table() can deliver a
> very compact, useful result.
> And don't forget to search the Archives if you have a regular but
> non-rectangular arrangement.
> David Winsemius
> Alameda, CA, USA
> Thanks, David!
> You have quickly summarized a set of techniques that it took me a long time
> to learn (much of it spent disentangling the truth from various
> misconceptions about the data-reading process. I don't think I have very
> much to add to your list, but as always, the effectiveness depends on
> correct implementation, and I have made a _lot_ of mistake in trying to
> implement these in the past. Moreover, all these thing become much more
> complicated if the file is too big to just read into a data frame. I am
> working with Census records right now, and my primary data file is a 14 gig
> csv that had me tearing my hair out trying to read it and pull out the
> variables I have needed at any given moment.
> I finally did get it read and the right subset extracted, but it was a
> pretty empirical process - I would just keep trying things that didn't work
> until I found something that did, often not quite understanding why my
> previous efforts had failed. I know that If I have to do this again six
> months from now I will have no idea how I did it. So I wanted to reduce the
> things that worked to functions and set up a sort of decision tree that I
> could work through to find and correct at least the more common problems.
> But I was hoping -- am still hoping, actually -- to find that someone else
> has already done this so I could get back to my real work. It seems like the
> sort of thing that could easily be buried in the 100+ pages of documentation
> of one of the big utility packages like Hmisc, MASS or car.
> I have often wished there was a data manipulation and import/export task
> view, with a purview to cover things like what I am talking about here, the
> contents of Phil Spector's book, and packages like Hadley Wickham's plyr.
> Warmest regards, andrewH
> View this message in context: http://r.789695.n4.nabble.com/Diagnostic-and-helper-functions-for-defective-hard-to-import-files-tp4684357p4684364.html
> Sent from the R help mailing list archive at Nabble.com.
> R-help at r-project.org mailing list
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
More information about the R-help