[R] Getting information encoded in a SAS, SPSS or Stata command file into R.

Tue Nov 13 20:41:14 CET 2012

On Nov 13, 2012, at 7:20 AM, Anthony Damico wrote:

> Hi Andrew, to work with the Current Population Survey with R, your best
> best is to use a variant of my SAScii package that works with a SQLite
> database (and therefore doesn't overload RAM).
> 
> I have written obsessively-documented code about how to work with the CPS
> in R here..
> 
> http://usgsd.blogspot.com/search/label/current%20population%20survey%20%28cps%29
> 
> ..but example only loads one year of data at a time.  The function
> read.SAScii.sqlite() used in that code can be run on a 51 year data set
> just the same.
> 
> If you need to generate standard errors, confidence intervals, or
> variances, I don't recommend using ffdf for complex sample surveys -- in my
> experience it doesn't work well with R's survey package.
> 
> These scripts use the Census Bureau version of the CPS, but you can make
> some slight changes and get it working on IPUMS files too..  Let me know if
> you run into any trouble.  :)

I'd like to take this opportunity to thank Anthony for his work on this dataset as well as on several others. The ones I am most interested in are the NHANES-III and Continuous NHANES datasets and he has the 2009-2010 set from the Continuous NHANES series represented in his examples. Scraping the list of datasets from his website:

available data

	• area resource file (arf) (1)
	• consumer expenditure survey (ce) (1)
	• current population survey (cps) (1)
	• general social survey (gss) (1)
	• national health and nutrition examination survey (nhanes) (1)
	• national health interview survey (nhis) (1)
	• national study of drug use and health (nsduh) (1)

And thanks to you for this question, andrewH; 

... it prompted a response from Jan to a package by Jan van der Laan which had subsequent links (via a reverseDepends citation) to a SEERabomb package by Tomas Radivoyevitch that provides examples of handling the SEER datasets, at least the Hematologic tumors dataset. My experience with SEER data in the past has been entirely mediated through SEER*Stat which is (somewhat) user-friendly Windows package for working with the SEER fixed field formats, but it should be exciting to see another accessible avenue through R.

Thanks, Anthony, Jan, and andrewH, and further thanks to Thomas Lumley on whose work I believe Anthony's package Depends because of the need for proper handling of the sampling weights.

-- 
David Winsemius
> 
> Anthony
> 
> 
> 
> On Mon, Nov 12, 2012 at 11:23 PM, andrewH <ahoerner at rprogress.org> wrote:
> 
>> Dear folks ˆ
>> I have a large (26 gig) ASCII flat file in fixed-width format with about 10
>> million observations of roughly 400 variables.  (It is 51 years of Current
>> Population Survey micro data from IPUMS, roughly half the fields for each
>> record).  The file was produced by automatic process in response to a data
>> request of mine.
>> 
>> The file is not accompanied by a human-readable file giving the fieldnames
>> and starting positions for each field.  Instead it comes with three command
>> files that describe the file, one each for SAS SPSS, and Stata. I do not
>> have ready access to any of these programs.  I understand that these files
>> also include the equivalent of the levels attribute for the coded data.  I
>> might be able to hand-extract the information I need from the command
>> files,
>> but this would involve days of tedious work that I am hoping to avoid.
>> 
>> I have read through the R Data Import/Export manual 2 and the foreign
>> package documentation and I do not see anything that would allow me to
>> extract the necessary information from these command files. Does anyone
>> know
>> of any r package or other non-proprietary tools that would allow me to get
>> this data set from its current form into any of the following formats:
>> SAS, SPSS or Stata binary files read by R.
>> A MySQL data base
>> An ffdf object readable using the ff package.
>> 
>> My ultimate goal is to get the data into an ffdf object so that I can
>> manipulate it in R, perhaps by way of a database. In allocation I will
>> probably be using no more than 20 variables at a time, probably a bit under
>> a gig. I am working on a machine with three gig of ram.
>> 
>> (I have seen some suggestions that data.table also provides a
>> memory-efficient way of providing database-like functions, but I am unsure
>> whether it would let me cope with an object of this size).
>> 
>> Any help or suggestions anyone could offer would be very much appreciated.
>> 
>> Warmest regards, andrewH
>> 
> 

David Winsemius, MD
Alameda, CA, USA