[R] Memory Problems with CSV and Survey Objects

tlumley at u.washington.edu tlumley at u.washington.edu
Fri Oct 23 19:24:23 CEST 2009



Yes, a 350Mb data frame is a bit big for 32-bit R to handle conveniently.

As you note, the survey package doesn't yet do database-backed replicate-weight designs. You can get the same effect yourself without too much work.

First, put the data into a database, such as SQLite.  If you have the data frame read in then dbWriteTable will do it.

Now, drop most of the variables, keeping the sampling weights, replicate weights, and a couple of other variables.

Create a svrepdesign() with the reduced data set.

When you want to do an analysis, use dbGetQuery() to load the variables you need for the analysis, and put them in the $variables component of the svrepdesign.

That's exactly what the database-backed functions do for svydesign objects.

[If you only ever want to use a small subset of the variables, it's even easier: drop all the extraneous variables and create a svrepdesign with the variables you want]

        -thomas

On Fri, 23 Oct 2009, Anthony Damico wrote:

> I'm working with a 350MB CSV file on a server that has 3GB of RAM, yet I'm
> hitting a memory error when I try to store the data frame into a survey
> design object, the R object that stores data for complex sample survey data.
>
> When I launch R, I execute the following line from Windows:
> "C:\Program Files\R\R-2.9.1\bin\Rgui.exe" --max-mem-size=2047M
> Anything higher, and I get an error message saying the maximum has been set
> to 2047M.
>
> Here are the commands:
>> library(survey)
>
> #this step takes more than five minutes
>> data08<-read.csv("data08.csv",header=TRUE,nrows=210437)
>
>> object.size(data08)
> #329877112 bytes
>
> #Looking at Windows Task Manager, Mem Usage for Rgui.exe is already 659,632K
>
>> brr.dsgn <-svrepdesign( data = data08 , repweights = data08[, grep(
> "^repwgt" , colnames( data08)) ], type = "BRR" , combined.weights = TRUE ,
> weights = data08$mainwgt )
> #Error: cannot allocate vector of size 254.5 Mb
>
> #The survey design object does not get created.
>
> #This also causes Windows Task Manager, Mem Usage to spike to 1,748,136K
>
> #And here are some memory diagnostics
>> memory.limit()
> [1] 2047
>> memory.size()
> [1] 1449.06
>> gc()
>           used  (Mb) gc trigger   (Mb)  max used   (Mb)
> Ncells   131148   3.6     593642   15.9  15680924  418.8
> Vcells 45479988 347.0  173526492 1324.0 220358611 1681.3
>
> A description of the survey package can be found here:
> http://faculty.washington.edu/tlumley/survey/
>
> I tried creating a work-around by using the database-backed survey objects
> (DB SO), included in the survey package to conserve memory on larger
> datasets like this one.  Unfortunately, I don't think the survey package
> supports database connections for replicate weight designs yet, since I've
> only been able to get a database connection working after creating a
> svydesign object and not a svrepdesign object - and also because neither the
> DB SO website nor the svrepdesign help page make any mention of those
> parameters.
>
> The DB SOs are described in detail here:
> http://faculty.washington.edu/tlumley/survey/svy-dbi.html
>
> Any advice would be truly appreciated.
>
> Thanks,
> Anthony Damico
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

Thomas Lumley			Assoc. Professor, Biostatistics
tlumley at u.washington.edu	University of Washington, Seattle




More information about the R-help mailing list