[R] Big Data reading subsample csv

Greg Snow 538280 at gmail.com
Thu Aug 16 18:43:20 CEST 2012


The read.csv.sql function in the sqldf package may make this approach
quite simple.

On Thu, Aug 16, 2012 at 10:12 AM, jim holtman <jholtman at gmail.com> wrote:
> Why not put this into a database, and then you can easily extract the
> records you want specifying the record numbers.  You play the one time
> expense of creating the database, but then have much faster access to
> the data as you make subsequent runs.
>
> On Thu, Aug 16, 2012 at 9:44 AM, Tudor Medallion
> <tudormedallion at googlemail.com> wrote:
>> Hello,
>>
>> I'm most grateful for your time to read this.
>>
>> I have a uber size 30GB file of 6 million records and 3000 (mostly
>> categorical data) columns in csv format. I want to bootstrap subsamples for
>> multinomial regression, but it's proving difficult even with my 64GB RAM
>>  in my machine and twice that swap file , the process becomes super slow
>> and halts.
>>
>> I'm thinking about generating subsample indicies in R and feeding them into
>> a system command using sed or awk, but don't know how to do this. If
>> someone knew of a clean way to do this using just R commands, I would be
>> really grateful.
>>
>> One problem is that I need to pick complete observations of subsamples,
>> that is I need to have all the rows of a particular multinomial observation
>> - they are not the same length from observation to observation. I plan to
>> use glmnet and then some fancy transforms to get an approximation to the
>> multinomial case. One other point is that I don't know how to choose sample
>> size to fit around memory limits.
>>
>> Appreciate your thoughts greatly.
>>
>>
>>> R.version
>>
>> platform       x86_64-pc-linux-gnu
>> arch           x86_64
>> os             linux-gnu
>> system         x86_64, linux-gnu
>> status
>> major          2
>> minor          15.1
>> year           2012
>> month          06
>> day            22
>> svn rev        59600
>> language       R
>> version.string R version 2.15.1 (2012-06-22)
>> nickname       Roasted Marshmallows
>>
>>
>> tags: read.csv(), system(), awk, sed, sample(), glmnet, multinomial, MASS.
>>
>> Yoda
>>
>>         [[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Jim Holtman
> Data Munger Guru
>
> What is the problem that you are trying to solve?
> Tell me what you want to do, not how you want to do it.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Gregory (Greg) L. Snow Ph.D.
538280 at gmail.com



More information about the R-help mailing list