[R] Importing random subsets of a data file

David Winsemius dwinsemius at comcast.net
Wed Jul 23 18:55:26 CEST 2014


I think an external program like awk (or gawk) would be better. You can call it with the R system() function if needed.

http://stackoverflow.com/questions/7514896/select-random-3000-lines-from-a-file-with-awk-codes

You might want to sample once and then break into sequential subsets rather than calling 10,000 times.

You have not said whether this is to be done with or without replacement, but if the rows have a unique identifier there is always the `duplicated` function.

-- 
David.

On Jul 23, 2014, at 8:37 AM, Sarah Goslee wrote:

> Hi,
> 
> You can use scan() with the nlines and skip arguments to read in a
> single line from anywhere in a file.
> 
> Sarah
> 
> On Wed, Jul 23, 2014 at 11:33 AM, Khurram Nadeem
> <khurram.nadee at gmail.com> wrote:
>> Hi R folks,
>> 
>> Here is my problem.
>> 
>> *1.* I have a large data file (say, in .csv or .txt format) containing 1
>> million rows and 500 variables (columns).
>> 
>> *2.* My statistical algorithm does not require the entire dataset but just
>> a small random sample from the original 1 million rows.
>> 
>> *3. *This algorithm needs to be applied 10000 times, each time generating a
>> different random sample from the 'big' file as described in (2) above.
>> 
>> Is there a way to 'import' only a (random) subset of rows from the .csv
>> file without importing the entire dataset? A quick search on various R
>> forums suggest that read.table() does not have this functionality.
>> Obviously, I want to avoid importing the whole file because of memory
>> issues. Looking forward to your help.
>> 
>> Thanks,
>> Khurram
> 
> 
> -- 
> Sarah Goslee
> http://www.functionaldiversity.org
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius
Alameda, CA, USA



More information about the R-help mailing list