[R] Comparing/diffing strings

Martin Morgan mtmorgan at fhcrc.org
Tue Aug 24 18:25:01 CEST 2010


On 08/24/2010 07:27 AM, Doran, Harold wrote:
> There is the stringMatch function in the MiscPsycho package. 
> 
>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'no')
> [1] 8
>> stringMatch('Hadley', 'Hadley Wickham', normalize = 'yes')
> [1] 0.4285714
> 
> It uses Levenshtein distance to tell you how much they differ by, either normalized or not. So, the above two tell you the first string differs from the second string by 8 insertions/deletions/substitutions. The second number normalizes the comparison such that 1 denotes perfect agreement and 2 denotes imperfect agreement.
> 
> Examples of an exact match are below.
> 
>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'yes')
> [1] 1
>> stringMatch('Hadley Wickham', 'Hadley Wickham', normalize = 'n')
> [1] 0

You're probably looking for something lighter weight, but Bioconductor
Biostrings has pairwiseAlignment.

> library(Biostrings)
> pairwiseAlignment("Hadley Wickham", "Hadley Hamwick")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] Hadley W---ick
subject: [1] Hadley Hamwick
score: 29.5102

> pairwiseAlignment("Hadley Hamwick", "Hadley Wickham")
Global PairwiseAlignedFixedSubject (1 of 1)
pattern: [1] Hadley Hamwick
subject: [1] Hadley W---ick
score: 29.5102

> aln <- pairwiseAlignment("Hadley Hamwick", "Haderley Hamwich")
> consensusMatrix(aln)["-",]
 [1] 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0

Martin

> 
> -----Original Message-----
> From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Hadley Wickham
> Sent: Tuesday, August 24, 2010 10:17 AM
> To: R-help
> Subject: [R] Comparing/diffing strings
> 
> Hi all,
> 
> all.equal is generally very useful when you want to find the
> differences between two objects.  It breaks down however, when you
> have two long strings to compare:
> 
>> all.equal(a, b)
> [1] "1 string mismatch"
> 
> Does any one know of any good text diffing tools implemented in R?
> 
> Thanks,
> 
> Hadley
> 


-- 
Martin Morgan
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793



More information about the R-help mailing list