[R] Memory Utilization on R

R. Michael Weylandt michael.weylandt at gmail.com
Tue Mar 27 18:40:28 CEST 2012


It's really not suggested etiquette to thread-jack, but generally, the
more you can tell to read.table (particularly the colClasses, nrows,
as.is, and stringsAsFactors arguments) the faster it will be able to
read things by skipping various necessary checks.

Michael

On Tue, Mar 27, 2012 at 12:07 PM, Alekseiy Beloshitskiy
<abeloshitskiy at velti.com> wrote:
> Guys, let me add my 5 coins into your interesting discussion.
>
> I have ~10Gb txt file with train data for my model. It has about 150 millions rows for 12 variables.
> When I load it into memory (just run only one row!):
>
> train<-read.table(file="/training.txt")
>
> while loading it takes ~28Gb of RAM (It takes about 2hours to finish), and when data are loaded, rsession takes ~14Gb.
>  I even can't imagine how much it will take when I will run svm train on this data set. Is there any optimization to decrease time required for loading data into memory.
> I use 32RAM x64 box.
>
> Thank you,
> -Alex
>
> ________________________________________
> From: r-help-bounces at r-project.org [r-help-bounces at r-project.org] on behalf of Kurinji Pandiyan [kurinji.pandiyan at gmail.com]
> Sent: 27 March 2012 18:14
> To: R. Michael Weylandt
> Cc: r-help at r-project.org
> Subject: Re: [R] Memory Utilization on R
>
> Thank you for the modified script! I have now tried on different datasets
> and it works very well and is dramatically faster than my original script!
>
> I really appreciate the help.
> Kurinji
>
> On Fri, Mar 23, 2012 at 1:33 PM, R. Michael Weylandt <
> michael.weylandt at gmail.com> wrote:
>
>> Taking a look at your script: there are a some potential optimizations
>> you can do:
>>
>>  # Fine
>> poi <- as.character(top.GSM396290) #5000 characters
>> x.data <- h1[,c(1,7:9)] # 485577 obs of 4 variables
>>
>> # Pre-allocate the space
>> x <- vector("list", 485577) # x <- list()
>>
>> # Do the "a" stuff once outside the loop so you aren't doing it 485577
>> times
>> a <- strsplit(as.character(x.data[, "UCSC_REFGENE_NAME"]), ";")
>>
>> # Lets use an apply statement instead of a for loop
>> # vapply is the fastest since we prespecify the return type.
>> x.data[vapply(a, function(x) any(poi %in% x), logical(1)), ]
>>
>> I think this will do what you wanted (and hopefully much faster)
>>
>> Note that you could probably tune this further but I think this
>> strikes a good balance between clarity and performance (for now)
>>
>> Hope this helps,
>>
>> Michael
>>
>> On Fri, Mar 23, 2012 at 11:52 AM, Kurinji Pandiyan
>> <kurinji.pandiyan at gmail.com> wrote:
>> >
>> > Thank you for the input.
>> >
>> > As it were, I realized that my script is utilizing a lot more memory than
>> > I claimed - it was initially using 3 GB but has gone up to 20.24 active
>> but
>> > 29.63 assigned to the R session.
>> >
>> > The script has run overnight and now I don't think it is active anymore
>> > since I keep getting the error message that I am out of startup disk
>> space
>> > for application memory.
>> >
>> > I am attaching screen shots of my RAM usage distribution (given that
>> there
>> > is no fluctuation in the usage by the R session I believe it is not
>> running
>> > anymore) and of my available HD.
>> >
>> >
>> >
>> >
>> >
>> > Here is my script -
>> >
>> > poi <- as.character(top.GSM396290) #5000 characters
>> > x.data <- h1[,c(1,7:9)] # 485577 obs of 4 variables
>> > head(x.data)
>> >
>> > x <- list()
>> >
>> > for(i in 1:485577){
>> >  a <- as.character(x.data[i, "UCSC_REFGENE_NAME"])
>> >  a <- unlist(strsplit(a, ";"))
>> >  if(any(poi %in% a) == TRUE) {x[[i]] <- x.data[i,]}
>> >   }
>> >
>> >  # this step completed in a few hours
>> >
>> > x <- do.call(rbind, x) # this step has been running overnight and is
>> still
>> > stuck
>> >
>> > Thanks, I really appreciate the help.
>> > Kurinji
>> >
>> > On Thu, Mar 22, 2012 at 10:44 PM, R. Michael Weylandt
>> > <michael.weylandt at gmail.com> wrote:
>> >>
>> >> Well... what makes you think you are hitting memory constraints then?
>> >> If you have significantly less than 3GB of data, it shouldn't surprise
>> >> you if R never needs more than 3GB of memory.
>> >>
>> >> You could just be running your scripts inefficiently...it's an extreme
>> >> example, but all the memory and gigaflopping in the world can't speed
>> >> this up (by much):
>> >>
>> >> for(i in seq_len(1e6)) Sys.sleep(10)
>> >>
>> >> Perhaps you should look into profiling tools or parallel
>> >> computation...if you can post a representative example of your
>> >> scripts, we might be able to give performance pointers.
>> >>
>> >> Michael
>> >>
>> >> On Fri, Mar 23, 2012 at 1:33 AM, Kurinji Pandiyan
>> >> <kurinji.pandiyan at gmail.com> wrote:
>> >> > Yes, I am.
>> >> >
>> >> > Thank you,
>> >> > Kurinji
>> >> >
>> >> > On Mar 22, 2012, at 10:27 PM, "R. Michael Weylandt"
>> >> > <michael.weylandt at gmail.com> wrote:
>> >> >
>> >> >> Use 64bit R?
>> >> >>
>> >> >> Michael
>> >> >>
>> >> >> On Thu, Mar 22, 2012 at 5:22 PM, Kurinji Pandiyan
>> >> >> <kurinji.pandiyan at gmail.com> wrote:
>> >> >>> Hello,
>> >> >>>
>> >> >>> I have a 32 GB RAM Mac Pro with a 2*2.4 GHz quad core processor and
>> >> >>> 2TB
>> >> >>> storage. Despite this having so much memory, I am not able to get R
>> >> >>> to
>> >> >>> utilize much more than 3 GBs. Some of my scripts take hours to run
>> >> >>> but I
>> >> >>> would think they would be much faster if more memory is utilized.
>> How
>> >> >>> do I
>> >> >>> optimize the memory usage on R by my Mac Pro?
>> >> >>>
>> >> >>> Thank you!
>> >> >>> Kurinji
>> >> >>>
>> >> >>>        [[alternative HTML version deleted]]
>> >> >>>
>> >> >>> ______________________________________________
>> >> >>> R-help at r-project.org mailing list
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide
>> >> >>> http://www.R-project.org/posting-guide.html
>> >> >>> and provide commented, minimal, self-contained, reproducible code.
>> >
>> >
>>
>
>        [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



More information about the R-help mailing list