[R] Comparison of two very large strings

David Winsemius dwinsemius at comcast.net
Tue Jul 13 01:16:56 CEST 2010


On Jul 12, 2010, at 6:46 PM, David Winsemius wrote:

>
> On Jul 12, 2010, at 6:03 PM, harsh yadav wrote:
>
>> Hi,
>>
>> I have a function in R that compares two very large strings for  
>> about 1
>> million records.
>>
>> The strings are very large URLs like:-
>>
>>
>> http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
>> .
>> ..
>>
>> or of larger lengths.
>>
>> The data-frame looks like:-
>>
>> id url
>> 1
>> http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
>> .
>> ..
>> 2   http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse
>> 3
>> http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
>> .
>> ..
>> 4
>> http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N
>> 5
>> http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt
>>
>> and so on for about 1 million records.
>>
>> Here is the function that I am using to compare the two strings:-
>>
>> stringCompare <- function(currentURL, currentId){
>> j <- currentId - 1
>> while(j>=1)
>> previousURL <-
>> previousURLLength <- nchar(previousURL)
>> #Compare smaller with bigger
>> if(nchar(currentURL) <= previousURLLength){
>> matchPhrase <- substr(previousURL,1,nchar(currentURL))
>> if(matchPhrase == currentURL){
>> return(TRUE)
>> }
>> }else{
>> matchPhrase <- substr(currentURL,1,previousURLLength)
>> if(matchPhrase == previousURL){
>> return(TRUE)
>> }
>> }
>> j <- j -1
>> }
>> return(FALSE)
>> }
>
> Couldn't you just store the "url" vector after running  through  
> nchar and then do the comparison in a vectorized manner?
>
> test <- rd.txt('id url
> 1 "http://query.nytimes.com/gst/sitesearch_selector.html?query=US+Visa+Laws&type=nyt&x=25&y=8 
> "
> 2 "http://query.nytimes.com/search/sitesearch?query=US+Visa+Laws&srchst=cse 
> "
> 3 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
> "
> 4 "http://www.google.com/search?hl=en&q=us+student+visa+changes+9/11+washington+post&start=10&sa=N 
> "
> 5 "http://www.google.com/url?sa=U&start=11&q=http://app1.chinadaily.com.cn/star/2004/0610/fo4-1.html&ei=uUKwSe7XN9CCt 
> "', stringsAsFactors=FALSE)
>
> copyUrls <- test[,"url"]
> sizeUrls <- nchar(copyUrls)
> lengU <- length(sizeUrls)
> sizidx <- pmax(sizeUrls[1:(lengU-1)], sizeUrls[2:(lengU)])
> substr(copyUrls[2:lengU], 1, sizidx) == substr(copyUrls[1: 
> (lengU-1)], 1, sizidx)
>
> #[1] FALSE FALSE  TRUE FALSE
>

Let me hasten to admit that when I tried to fix what I thought was an  
error in that program, I go the same result. It seemed as though I  
should have been getting errors by choosing the maximum string  
length.  Changing the pmax to pmin did not alter the results ... to my  
puzzlement ... until I further noticed that urls #3 and #4 were of the  
same length. When I extend the lengths, then only the version using  
pmin works properly.


-- 
David.
>
>> Here, I compare the URL at a given row with all the previous URLs  
>> in the
>> data-frame. I compare the smaller of the two given URls with the  
>> larger one
>> (upto the length of the smaller).
>>
>> When I run the above function for about 1 million records, the  
>> execution
>> becomes really slow, which otherwise is fast if I remove the
>> string comparison step.
>>
>> Any ideas how it can be implemented in a fast and efficient way.
>>
>> Thanks and Regards,
>> Harsh Yadav
>>
>> 	[[alternative HTML version deleted]]
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> David Winsemius, MD
> West Hartford, CT
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
West Hartford, CT



More information about the R-help mailing list