[R] Handling huge data of 17GB in R

Ajay Ramaseshan ajay_ramaseshan at hotmail.com
Fri Nov 27 12:03:05 CET 2015


Hello,


I am trying the DBSCAN clustering algorithm on a huge data matrix (26000 x 26000). I dont have the datapoints, just the distance matrix. It comes to 17 GB in the hard disk, and needs to be loaded into R to use the DBSCAN implementation (under package fpc). So I tried using read.csv but R crashed.


I am getting the message 'Killed after it runs for 10 minutes'


 dist<-read.csv('dist.csv',header=FALSE)
Killed

So I chceked is there any R package that handles big data like this, and came across bigmemory package in R. So I installed it and ran this command, but even this does not work, R exits.

> dist<-read.big.matrix('dist.csv',sep=',',header=FALSE)

 *** caught bus error ***
address 0x7fbc4faba000, cause 'non-existent physical address'

Traceback:
 1: .Call("bigmemory_CreateSharedMatrix", PACKAGE = "bigmemory",     row, col, colnames, rownames, typeLength, ini, separated)
 2: CreateSharedMatrix(as.double(nrow), as.double(ncol), as.character(colnames),     as.character(rownames), as.integer(typeVal), as.double(init),     as.logical(separated))
 3: big.matrix(nrow = numRows, ncol = createCols, type = type, dimnames = list(rowNames,     colNames), init = NULL, separated = separated, backingfile = backingfile,     backingpath = backingpath, descriptorfile = descriptorfile,     binarydescriptor = binarydescriptor, shared = TRUE)
 4: read.big.matrix("dist.csv", sep = ",", header = FALSE)
 5: read.big.matrix("dist.csv", sep = ",", header = FALSE)

Possible actions:
1: abort (with core dump, if enabled)
2: normal R exit
3: exit R without saving workspace
4: exit R saving workspace
Selection: 2
Save workspace image? [y/n/c]: n
Warning message:
In read.big.matrix("dist.csv", sep = ",", header = FALSE) :
  Because type was not specified, we chose double based on the first line of data.


So how do I handle such huge data in R for DBSCAN? Or is there any other implementation of DBSCAN in other programming language which can handle such a huge distance matrix of 17 GB ?



Regards,

Ajay

	[[alternative HTML version deleted]]



More information about the R-help mailing list