# Forum

General discussion on using the Kognitio Analytical Platform.

### Edit Distance - Kognitio Feature of the Month

Contributor
Offline
Posts: 39
Joined: Mon Jan 06, 2014 10:36 am

### Edit Distance - Kognitio Feature of the Month

Some recent work with clients has resulted in Kognitio implementing a function for calculating the Levenstein Edit Distance between 2 strings. This function is designed to give a measure of string similarity by finding the minimum number of basic edits (insertion, deletion or substitution) to convert one string into another.

For more details on the Kognitio implementation on some performance testing check out our July 2014 Feature of the Month article. The algorithm implemented by Kognitio is here. There are also acompanying SQL and kog scripts.

This function is now generally available in Kognitio release 8.1.0rel140516b. Below is a walk through of a simple example of the SQL implementation. The function call is a row based function and works for ASCII/ LATIN-1 strings:

Code: Select all

``````SELECT string1
,string2
,edit_distance(string1,string2,[threshold])
FROM word_list``````
Suppose you have a table containing a list of words you wish to compare, each with a unique ID:

Code: Select all

``SELECT * FROM word_list;``
yields the results:
 WORD ID The 1 the 2 of 3 V 4 TV 5 s 6 and 7 de 8 Show 9 a 10
Then we can calculate the edit_distance between these by doing a Cartesian join(cross join):

Code: Select all

``````select a.word word1
, b.word word2
, edit_distance(a.word,b.word) edit_dist_limit_none
, edit_distance(a.word,b.word,1) edit_dist_limit_1
from
sharon.word_list a,
sharon.word_list b
where a.id<b.id
order by a.id,b.id;``````
Full results are given below. There are 2 forms of the function without and with a threshold. The version with a threshold (specified by the optional 3rd parameter in the function) simply makes the function exit early if the specified threshold is exceeded.

A few things to note:
1. Differing Case is considered the same as any other edit. See first result row. One edit is required to change “The” into “the”. Use in conjunction with SQL command UPPER() or LOWER() to negate case.
2. When using the function with a threshold included (1 in the example code) the result returned for a string comparison where the edit distance is greater than the limit is equal to limit + 1, see result row 2. This is useful if you want to obtain results with a certain value of the edit distance. The EDIT_DISTANCE function with the threshold can be applied in the WHERE predicate and computation stops as soon as the limit is exceeded.
3. If the difference in the string lengths is greater than the threshold then this difference is returned from the EDIT_DISTANCE function call, see comparison of “s” and “Show” result. This is because the edit distance is always at least equal to difference in string lengths and this difference is used as the initial value of edit distance in the algorithm.
 WORD1 WORD2 EDIT_DIST_LIMIT_NONE EDIT_DIST_LIMIT_1 The the 1 1 The of 3 2 The V 3 2 The TV 2 2 The s 3 2 The and 3 2 The de 2 2 The Show 3 2 The a 3 2 the of 3 2 the V 3 2 the TV 3 2 the s 3 2 the and 3 2 the de 2 2 the Show 3 2 the a 3 2 of V 2 2 of TV 2 2 of s 2 2 of and 3 2 of de 2 2 of Show 3 2 of a 2 2 V TV 1 1 V s 1 1 V and 3 2 V de 2 2 V Show 4 3 V a 1 1 TV s 2 2 TV and 3 2 TV de 2 2 TV Show 4 2 TV a 2 2 s and 3 2 s de 2 2 s Show 4 3 s a 1 1 and de 3 2 and Show 4 2 and a 2 2 de Show 4 2 de a 2 2 Show a 4 3
The algorithm implemented by Kognitio was developed by Berghel and Roach and can be found here.

### Who is online

Users browsing this forum: No registered users and 1 guest