[R] loop for a large database

Petr Savicky savicky at cs.cas.cz
Mon Feb 27 09:01:38 CET 2012


On Sun, Feb 26, 2012 at 11:39:01AM -0800, mari681 wrote:
> SORRY!
> 
> The data in MyTable are tagsets of photos,  like this:
> 
>       V1         V2       V3      V4      V5       V6        V7   V8
> 230    green nailpolish   barrym       0       0        0         0    0
> 231       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
> 232    green       saul  lecture       0       0        0         0    0
> 233    green     colors    cores  market colores marakesh   mercado malu
> 234       ny      green brooklyn cleanup   clean  gowanus volunteer  gcc
> 235    green       saul  lecture       0       0        0         0    0
> 236 portrait        pet    white   green     cat    canon    square  eos
> 
>                          V9   V10  V11      V12 V13 V14 V15
> 230                       0     0    0        0   0   0   0
> 231 gowanuscanalconservancy     0    0        0   0   0   0
> 232                       0     0    0        0   0   0   0
> 233               malugreen maroc souk marrocos   0   0   0
> 234 gowanuscanalconservancy     0    0        0   0   0   0
> 235                       0     0    0        0   0   0   0
> 236                      is  eyes mark   taiwan  ii mk2  5d
> 
> 
> while data of MyVector is a list of tags (none of the columns in particular)
> whose frequency in MyTable has to be computed. Like this:
> 
> [1] "life"  "wood"  "pink"  "house" "green" "fall" 

Hi.

Just to be sure, in all the previous solutions, "malugreen" is not an
occurence of "green". Is this correct?

> MyTable has 21 millions rows and 15 columns, and the data is "character",
> they are words.

Do you use the argument stringsAsFactors=FALSE, when reading the data
from a file? Otherwise, character data are converted to a factor.
The discussed solutions work in both cases, however, if we try to
prepare simplified data for testing efficiency, we should use the
same column class as in the real situation.

> When I tried the loop my computer crashed in the meaning that it freezed
> (froze?) and didn't allow me to do anything. The morning after I forced it
> off and rebooted.

This does not seem to be a consequence of a too long computation.
A possible cause can be too large memory requirements. How large memory
the R process uses after loading the data? Try gc() command after loading
the data and compare with the amount of memory available. On a Linux
machine, it is also possible to see the memory usage with "top" command
in the row, where R is reported.

Petr.



More information about the R-help mailing list