[R] handling a lot of data

Paul Bivand paul.bivand at gmail.com
Mon Jan 30 17:54:39 CET 2012


If you do not need all the variables in the SPSS files, use package 'memisc'.
spss.system.file() and it's subset() allow you to just load the
variables needed.

You will need to transform into data.frame as the memisc data.set
includes the SPSS attributes, user-missings etc.

Paul Bivand
Centre for Economic and Social Inclusion
London

On 30 January 2012 16:02, R. Michael Weylandt
<michael.weylandt at gmail.com> wrote:
> This won't help with large memory issues, but just a pointer:
>
> When you start to construct data_all with these commands
>
> data_all = vector("list", 17);
> data_all[[1993]] = data1993;
>
> The first pre-allocates a list of length 17, but the second adds the
> data to the 1993rd slot requiring a complete reallocation. Look at
> length(data_all). You'd be better off in general with something like
> this:
>
> data_all <- vector("list", 17)
> names(data_all) <- 1993: 2010
> data_all[["1993"]] <- data1993
> etc.
>
> which creates a vector of length 17 with components named after the years.
>
> If you want to automate that last bit over each year, this would work:
>
> for( yr in 1993: 2010){
>    data_all[[as.character(yr)]] <- get(paste("data", yr, sep = ""))
> }
>
> It's also been pointed out to me that the Oarray package allows one to
> start indexing at an arbitrary point (e.g., 1993 for the first slot)
> which might be helpful for managing your data_all object.
>
> Michael
>
> On Mon, Jan 30, 2012 at 3:54 AM, Petr Kurtin <kurtin at avast.com> wrote:
>> Hi,
>>
>> I have got a lot of SPSS data for years 1993-2010. I load all data into
>> lists so I can easily index the values over the years. Unfortunately loaded
>> data occupy quite a lot of memory (10Gb) - so my question is, what's the
>> best approach to work with big data files? Can R get a value from the file
>> data without full loading into memory? How can a slower computer with not
>> enough memory work with such data?
>>
>> I use the following commands:
>>
>> data1993 = vector("list", 4);
>> data1993[[1]] = read.spss(...)  # first trimester
>> data1993[[2]] = read.spss(...)  # second trimester
>> ...
>> data_all = vector("list", 17);
>> data_all[[1993]] = data1993;
>> ...
>>
>> and indexing, e.g.: data_all[[1993]][[1]]$DISTRICT, etc.
>>
>> Thanks,
>> Petr Kurtin
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list