[R] Large Data Set Help

Charles C. Berry cberry at tajo.ucsd.edu
Mon Aug 25 23:29:34 CEST 2008


On Mon, 25 Aug 2008, Roland Rau wrote:

> Hi,
>
> Jason Thibodeau wrote:
>>  I am attempting to perform some simple data manipulation on a large data
>>  set. I have a snippet of the whole data set, and my small snippet is 2GB
>>  in
>>  CSV.
>>
>>  Is there a way I can read my csv, select a few columns, and write it to an
>>  output file in real time? This is what I do right now to a small test
>>  file:
>>
>>  data <- read.csv('data.csv', header = FALSE)
>>
>>  data_filter <- data[c(1,3,4)]
>>
>>  write.table(data_filter, file = "filter_data.csv", sep = ",", row.names =
>>  FALSE, col.names = FALSE)
>
> in this case, I think R is not the best tool for the job. I would rather 
> suggest to use an implementation of the awk language (e.g. gawk).
> I just tried the following on WinXP (zipped file (87MB zipped, 1.2GB 
> unzipped), piped into gawk)
> unzip -p myzipfile.zip | gawk '{print $1, $3, $4}' > myfiltereddata.txt

Or

unzip -p myzipfile.zip | cut -d, -f1,3,4 > myfiltereddata.txt

But beware that both this and Roland's solution will return

 	a,c,d

for an input line consisting of

 	a,"b,c",d,e,f

HTH,

Chuck

> and it took about 90 seconds.
>
> Please note that you might need to specify your delimiter (field separator 
> (FS) and output field separator (OFS)) =>
> gawk '{FS=","; OFS=","} {print $1, $3, $4}' data.csv > filter_data.scv
>
> I hope this helps (despite not encouraging the usage of R),
> Roland
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>

Charles C. Berry                            (858) 534-2098
                                             Dept of Family/Preventive Medicine
E mailto:cberry at tajo.ucsd.edu	            UC San Diego
http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901



More information about the R-help mailing list