[R] Diagnostic and helper functions for defective & hard-to-import files

Richard M. Heiberger rmh at temple.edu
Thu Jan 30 06:59:21 CET 2014


For Census data specifically you might want to look at the acs package.

I learned about it from

Ray DiGiacomo, Jr. rayd at liondatasystems.com

I am doing an R-oriented US Census webinar on August 29.
See this site for the recording of the webinar.
http://liondatasystems.com/rug.html

I think you will find this webinar interesting as it will debut the
new "acs" R package.

Best Regards,

Ray DiGiacomo, Jr.
Master R Trainer
Healthcare Predictive Analytics Specialist
President, Lion Data Systems LLC
rayd at liondatasystems.com
949-374-2289
Irvine, California USA


On Wed, Jan 29, 2014 at 2:12 AM, andrewH <ahoerner at rprogress.org> wrote:
> On Jan 28, 2014 at 8:56pm, David Winsemius wrote:
>
> On Jan 28, 2014, at 8:43 PM, andrewH wrote:
>
>> Hi Folks!
>> I have been writing a small set of utilities for dealing with files that
>> are
>> hard to open correctly for one reason or another, especially because they
>> are too big for memory, non-rectangular, or contain odd characters or
>> unexpected codings, or all of these things together. Today it suddenly hit
>> me that this has probably been done, done better, and upgraded to package
>> form a dozen times already. There were pointers to a couple functions
>> useful
>> in this regard in the Core Import/Export document.  But my effort to come
>> up
>> with search terms that were productive of such packages was unsuccessful.
>
> I don't know of a package to do that. You know the quote from that Russian
> author whose name I am forgetting (in "Anna Karinena" perhaps) about happy
> families being all the same but unhappy families being impossible to
> classify. I think it applies to datasets as well. There are too many
> different dataset pathologies to allow a neat packaging approach.
>
> My approach has been to study the options in read.table very carefully and
> if that is insufficient look at either readLines or scan as options. It is
> very useful to be able to use `count.fields` with different parameter
> settings of "quotes" and comment.char". Wrapping it in table() can deliver a
> very compact, useful result.
>
> And don't forget to search the Archives if you have a regular but
> non-rectangular arrangement.
>
> David Winsemius
> Alameda, CA, USA
>
> Thanks, David!
>
> You have quickly summarized a set of techniques that it took me a long time
> to learn (much of it spent disentangling the truth from various
> misconceptions about the data-reading process. I don't think I have very
> much to add to your list, but as always, the effectiveness depends on
> correct implementation, and I have made a _lot_  of mistake in trying to
> implement these in the past. Moreover, all these thing become much more
> complicated if the file is too big to just read into a data frame. I am
> working with Census records right now, and my primary data file is a 14 gig
> csv that had me tearing my hair out trying to read it and pull out the
> variables I have needed at any given moment.
>
> I finally did get it read and the right subset extracted, but it was a
> pretty empirical process - I would just keep trying things that didn't work
> until I found something that did, often not quite understanding why my
> previous efforts had failed.  I know that If I have to do this again six
> months from now I will have no idea how I did it. So I wanted to reduce the
> things that worked to functions and set up a sort of decision tree that I
> could work through to find and correct at least the more common problems.
> But I was hoping -- am still hoping, actually -- to find that someone else
> has already done this so I could get back to my real work. It seems like the
> sort of thing that could easily be buried in the 100+ pages of documentation
> of one of the big utility packages like Hmisc, MASS or car.
>
> I have often wished there was a data manipulation and import/export task
> view, with a purview to cover things like what I am talking about here, the
> contents of Phil Spector's book, and packages like Hadley Wickham's plyr.
>
> Warmest regards, andrewH
>
>
>
>
>
> --
> View this message in context: http://r.789695.n4.nabble.com/Diagnostic-and-helper-functions-for-defective-hard-to-import-files-tp4684357p4684364.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list