[R] Memory usage in R grows considerably while calculating word frequencies

Rainer M Krug r.m.krug at gmail.com
Wed Sep 26 09:58:50 CEST 2012


-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 25/09/12 01:29, mcelis wrote:
> I am working with some large text files (up to 16 GBytes).  I am interested in extracting the 
> words and counting each time each word appears in the text. I have written a very simple R 
> program by following some suggestions and examples I found online.

Just an idea (I have no experience with what you want to do, so it might not work):

What about putting the text in a database (sqlite comes to mind) where each word is one entry.
Then you could use sql to query the database, which should need much less memory.

In addition, it should make further processing much easier.

Cheers,

Rainer

> 
> If my input file is 1 GByte, I see that R uses up to 11 GBytes of memory when executing the 
> program on a 64-bit system running CentOS 6.3. Why is R using so much memory? Is there a
> better way to do this that will minimize memory usage.
> 
> I am very new to R, so I would appreciate some tips on how to improve my program or a better 
> way to do it.
> 
> R program: # Read in the entire file and convert all words in text to lower case 
> words.txt<-tolower(scan("text_file","character",sep="\n"))
> 
> # Extract words pattern <- "(\\b[A-Za-z]+\\b)" match <- gregexpr(pattern,words.txt) words.txt 
> <- regmatches(words.txt,match)
> 
> # Create a vector from the list of words words.txt<-unlist(words.txt)
> 
> # Calculate word frequencies words.txt<-table(words.txt,dnn="words")
> 
> # Sort by frequency, not alphabetically words.txt<-sort(words.txt,decreasing=TRUE)
> 
> # Put into some readable form, "Name of word" and "Number of times it occurs" 
> words.txt<-paste(names(words.txt),words.txt,sep="\t")
> 
> # Results to a file cat("Word\tFREQ",words.txt,file="frequencies",sep="\n")
> 
> 
> 
> -- View this message in context: 
> http://r.789695.n4.nabble.com/Memory-usage-in-R-grows-considerably-while-calculating-word-frequencies-tp4644053.html
>
>
> 
Sent from the R help mailing list archive at Nabble.com.
> 

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.11 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://www.enigmail.net/

iEYEARECAAYFAlBitboACgkQoYgNqgF2egr1pgCgjHxE/E1qIwUbrYzB30qIk9cK
z/oAoILCYn66+c9CF5tzkWeQH3E2utwi
=ahI5
-----END PGP SIGNATURE-----




More information about the R-help mailing list