[R] help with loading National Comorbidity Survey

Thomas Lumley tlumley at u.washington.edu
Tue Oct 4 19:30:46 CEST 2005


On Sat, 1 Oct 2005, Jim Hurd wrote:
>
> Which provides data in DTA  (STATA), XPT (SAS), and POR (SPSS) formats all
> of which I have tried to read with the foreign package but I am not able to
> load any of them. I have 2 gb of RAM, but R crashes when the memory gets
> just over 1 GB. I am using Windows version 2.1.1. The size of the DTA file
> is 48 MB; the xpt file is 188 MB.
>

If you mean the NCS 1 data file from that link (da06694-0001.dta) then I 
don't have this problem.

I have been able to load in the .dta file under Windows on a computer with 
1Gb of RAM.  The maximum memory use was about 350Mb.  It was very slow -- 
about half an hour.  This is because the processing of missing values and 
of factor levels is very inefficient in read.dta when dealing with very 
wide data frames. It makes calls to [.data.frame, [<-.data.frame, etc, for 
each column and so the time is probably quadratic in the number of 
columns.

The call to .External that does the actual reading took less than 1% of 
the time. If you only want a hundred or so of the 3000 variables it may be 
worth just using that .External() call to read the data, then subset it 
and then work out how to apply the factor levels and so on.

read.dta clearly needs a different algorithm to handle very wide data sets 
efficiently.

 	-thomas




More information about the R-help mailing list