[R] Efficient way to subset rows in R for dataset with 10^7 columns

Jeff Newmiller jdnewm|| @end|ng |rom dcn@d@v|@@c@@u@
Sat Apr 14 04:07:17 CEST 2018


Oh, there are ways, but the constraining issue here is moving data (memory bandwidth), and data table is probably already the fastest mechanism for doing that. If you have a computer with four or more real cores you can try setting up a subset of the columns in each task and cbind the results afterward, but it will be hard to accomplish without making extra copies of the data. You are already probably already using virtual memory which is saved to and from hard disk storage as needed. 

Working in Spark with a distributed file system like Hadoop might solve some of these problems... but I haven't done real work with such tools.

On April 13, 2018 6:31:32 PM PDT, Jack Arnestad <jackarnestad using gmail.com> wrote:
>Yes unfortunately. The goal of the "outer" is to do feature selection
>before fitting it to a model.
>
>Is there a way it could be parallelized?
>
>Thanks!
>
>On Fri, Apr 13, 2018 at 9:08 PM, Jeff Newmiller
><jdnewmil using dcn.davis.ca.us>
>wrote:
>
>> You have 10^7 columns? That process is bound to be slow.
>>
>> On April 13, 2018 5:31:32 PM PDT, Jack Arnestad
><jackarnestad using gmail.com>
>> wrote:
>> >I have a data.table with dimensions 100 by 10^7.
>> >
>> >When I do
>> >
>> >    trainIndex <-
>> >      caret::createDataPartition(
>> >        df$status,
>> >        p = .9,
>> >        list = FALSE,
>> >        times = 1
>> >      )
>> >    outerTrain <- df[trainIndex]
>> >    outerTest  <- df[-trainIndex]
>> >
>> >Subsetting the rows of df takes over 20 minutes.
>> >
>> >What is the best way to efficiently subset this?
>> >
>> >Thanks!
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> >______________________________________________
>> >R-help using r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> >https://stat.ethz.ch/mailman/listinfo/r-help
>> >PLEASE do read the posting guide
>> >http://www.R-project.org/posting-guide.html
>> >and provide commented, minimal, self-contained, reproducible code.
>>
>> --
>> Sent from my phone. Please excuse my brevity.
>>

-- 
Sent from my phone. Please excuse my brevity.




More information about the R-help mailing list