Similar String algorithm


Question

I'm looking for an algorithm, or at least theory of operation on how you would find similar text in two or more different strings...

Much like the question posed here: Algorithm to find articles with similar text, the difference being that my text strings will only ever be a handful of words.

Like say I have a string: "Into the clear blue sky" and I'm doing a compare with the following two strings: "The color is sky blue" and "In the blue clear sky"

I'm looking for an algorithm that can be used to match the text in the two, and decide on how close they match. In my case, spelling, and punctuation are going to be important. I don't want them to affect the ability to discover the real text. In the above example, if the color reference is stored as "'sky-blue'", I want it to still be able to match. However, the 3rd string listed should be a BETTER match over the second, etc.

I'm sure places like Google probably use something similar with the "Did you mean:" feature...

* EDIT *
In talking with a friend, he worked with a guy who wrote a paper on this topic. I thought I might share it with everyone reading this, as there are some really good methods and processes described in it...

Here's the link to his paper, I hope it is helpful to those reading this question, and on the topic of similar string algorithms.

1
20
5/23/2017 12:17:51 PM

Accepted Answer

I can't mark two answers here, so I'm going to answer and mark my own. The Levenshtein distance appears to be the correct method in most cases for this. But, it is worth mentioning j_random_hackers answer as well. I have used an implementation of LZMA to test his theory, and it proves to be a sound solution. In my original question I was looking for a method for short strings (2 to 200 chars), where the Levenshtein Distance algorithm will work. But, not mentioned in the question was the need to compare two (larger) strings (in this case, text files of moderate size) and to perform a quick check to see how similar the two are. I believe that this compression technique will work well but I have yet to study it to find at which point one becomes better than the other, in terms of the size of the sample data and the speed/cost of the operation in question. I think a lot of the answers given to this question are valuable, and worth mentioning, for anyone looking to solve a similar string ordeal like I'm doing here. Thank you all for your great answers, and I hope they can be used to serve others well too.

2
5/23/2017 12:10:02 PM

Levenshtein distance will not completely work, because you want to allow rearrangements. I think your best bet is going to be to find best rearrangement with levenstein distance as cost for each word.

To find the cost of rearrangement, kinda like the pancake sorting problem. So, you can permute every combination of words (filtering out exact matches), with every combination of other string, trying to minimize a combination of permute distance and Levenshtein distance on each word pair.

edit: Now that I have a second I can post a quick example (all 'best' guesses are on inspection and not actually running the algorithms):

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |    Into the c_lear blue sky 
The color is sky blue        |    is__ the colo_r blue sky

R_dist = dist( 3 1 2 5 4 ) --> 3 1 2 *4 5* --> *2 1 3* 4 5 --> *1 2* 3 4 5 = 3  
L_dist = (2D+S) + (I+D+S) (Total Subsitutions: 2, deletions: 3, insertion: 1)  

(notice all the flips include all elements in the range, and I use ranges where Xi - Xj = +/- 1)

Other example

original strings             | best rearrangement w/ lev distance per word
Into the clear blue sky      |   Into the clear blue sky 
In the blue clear sky        |   In__ the clear blue sky

R_dist = dist( 1 2 4 3 5 ) -->  1 2 *3 4* 5  = 1
L_dist = (2D) (Total Subsitutions: 0, deletions: 2, insertion: 0)

And to show all possible combinations of the three...

The color is sky blue         |    The colo_r is sky blue
In the blue clear sky         |    the c_lear in sky blue

R_dist = dist( 2 4 1 3 5 ) --> *2 3 1 4* 5 --> *1 3 2* 4 5 --> 1 *2 3* 4 5 = 3
L_dist = (D+I+S) + (S) (Total Subsitutions: 2, deletions: 1, insertion: 1)

Anyway you make the cost function the second choice will be lowest cost, which is what you expected!


Licensed under: CC-BY-SA with attribution
Not affiliated with: Stack Overflow
Icon