RES: [R] AGREP

Fri Feb 13 02:45:33 CET 2004

"Marcos Sanches" <marcos.sanches at ipsos-opinion.com.br> wrote:
	Ls1<-length(s1)
	Ls2<-length(s2) 
	for ( p in 1:ls1){
	   for (q in 1:ls2){
	     t1<-levenshteinFast(s1[p],s2[q])
...

	Ls1=42000
	Ls2=70000

	I think I will wait for months untill this program ends. Do you have any
	sugestion to increase the speed?

The first suggestion has to be "search HARD in the on-line literature;
as others are bound to have had similar problems."

My second suggestion is to consider sorting.
The cost of distance(x,y) is proportional to |x|*|y|.
Now suppose y and z have a common prefix of length n.
Then the computation of distance(x,z) can be done in |x|*(|z|-n) time
by reusing the first n+1 columns of the matrix developed for y.
The simplest way to arrange strings so that strings with a common
prefix are adjacent is to sort them.

Depending on what your strings are like, this might make a big
improvement, or it might be a waste of time,

There are various approximate algorithms out there; the bioinformatics
literature is vast.  But the use of the simple insert=1 delete=1 change=1
cost function is *also* an approximation.  If strings are typed, some
keys are closer than others.  If they are transcribed from speech, some
sounds are more similar than others and some deletions (such as reduced
vowels) are likelier than others.  So don't be afraid of approximations.