[R] Any better way of optimizing time for calculating distances in the mentioned scenario??
Stefan Evert
stefanML at collocations.de
Fri Oct 12 14:47:25 CEST 2012
On 12 Oct 2012, at 09:46, Purna chander wrote:
> 4) scenario4:
>> x<-read.table("query.vec")
>> v<-read.table("query.vec2")
>> v<-as.matrix(v)
>> d<-dist(rbind(v,x),method="manhattan")
>> m<-as.matrix(d)
>> m2<-m[1:nrow(v),(nrow(v)+1):nrow(x)]
>> print(m2[1,1:10])
>
> time taken for running the code:
> real 0m0.445s
> user 0m0.401s
> sys 0m0.041s
> 1) Though scenario 4 is optimum, this scenario failed when matrix 'v'
> having more no. of rows. An error occurred while converting distance
> object 'd' to a matrix 'm'.
> For E.g: > m<-as.matrix(d)
> the above command resulted in error: "Error: cannot allocate
> vector of size 922.7 MB".
That's because you're calculating a full distance matrix with (10000+100) * (10000+100) points and then extract the much smaller number of distance values (10000 * 100) that you actually need.
I have a use case with similar requirements, so ...
> 3) Any other ideas to optimize the problem i'm facing with.
... my experimental "wordspace" package includes a function dist.matrix() for calculating such cross-distance matrices. The function is written in C code and doesn't handle NA's and NaN's properly, but it's considerably faster than the current implementation of dist().
I haven't uploaded the package to CRAN yet, but you should be able to install with
install.packages("wordspace", repos="http://R-Forge.R-project.org")
Best,
Stefan
PS: Glad to see that daily builds on R-Forge work again -- that's an extremely useful feature to get beta testers for experimental package versions. :-)
More information about the R-help
mailing list