[R] Calculating distance matrix for large dataset

Uwe Ligges ligges at statistik.tu-dortmund.de
Fri May 3 20:18:26 CEST 2013



On 03.05.2013 15:36, David Carlson wrote:
> Here's the result on R 3.0.0 64 bit under Windows 8:
>
>> A<-matrix(1:365000*144,nrow=365000,ncol=144)
>> dim(A)
> [1] 365000    144
>> d <- dist(mydata_nor, method = "euclidean")
> Error in as.matrix(x) : object 'mydata_nor' not found
>> d <- dist(A, method = "euclidean")
> Error: cannot allocate vector of size 496.3 Gb
> In addition: Warning messages:
> 1: In dist(A, method = "euclidean") :
>    Reached total allocation of 8078Mb: see help(memory.size)
> 2: In dist(A, method = "euclidean") :
>    Reached total allocation of 8078Mb: see help(memory.size)
> 3: In dist(A, method = "euclidean") :
>    Reached total allocation of 8078Mb: see help(memory.size)
> 4: In dist(A, method = "euclidean") :
>    Reached total allocation of 8078Mb: see help(memory.size)
>
> Your message suggests that your system could not accurately compute the
> requirements. Unless you have access to a computer with 500 gigabytes, you
> need to consider alternate approaches such as aggregating the data into
> longer time blocks or using kmeans.


Or to show how we can calculate it:
Or simpler speaking, you need to calculate 365000 * (365000-1) / 2 = 
66612317500 distances and with 8 bytes each, hence you need 66612317500 
* 8 = 532898540000 Bytes = 532898540000 / (1024)^3 GB ~= 496.3 Gb to 
store it in memory.

Best,
Uwe Ligges




>
> -------------------------------------
> David L Carlson
> Associate Professor of Anthropology
> Texas A&M University
> College Station, TX 77840-4352
>
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On
> Behalf Of HJ YAN
> Sent: Thursday, May 2, 2013 6:02 PM
> To: r-help at r-project.org
> Subject: [R] Calculating distance matrix for large dataset
>
> Dear R users
>
>
> I wondered if any of you ever tried to calculate distance matrix with very
> large data set, and if anyone out there can confirm this error message I got
> actually mean that my data is too large for this task.
>
> negative length vectors are not allowed
>
>
> My data size and code used
>
>   dim(mydata_nor)[1] 365000    144> d <- dist(mydata_nor, method =
> "euclidean")
>
>
>
> Here my data has 1000 samples each has a year data observed by 10 minutes
> interval daily, so the size is  (365* 1000) * 144.
>
>
> I checked the manual of function 'dist' but can not see the upper limit size
> allowed, and I bet there should be one, so any hints is appreciated.
>
>
> I would also be grateful if any other method for calculating distance matrix
> for large dataset could be advised.
>
>
>
> I appreciate reproducible code should be provided for your advice, so try
> below if needed:
>
> A<-matrix(1:365000*144,nrow=365000,ncol=144)> dim(A)[1] 365000    144>
> d1<-dist(A,method="euclidean")Error in dist(A, method = "euclidean") :
>    negative length vectors are not allowed
>
>
>
>
> Many thanks in advance!
>
> HJ
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list