
skkirkham
 Contributor
 Offline
 Posts: 39
 Joined: Mon Jan 06, 2014 10:36 am
Edit Distance  Kognitio Feature of the Month
Some recent work with clients has resulted in Kognitio implementing a function for calculating the Levenstein Edit Distance between 2 strings. This function is designed to give a measure of string similarity by finding the minimum number of basic edits (insertion, deletion or substitution) to convert one string into another.
For more details on the Kognitio implementation on some performance testing check out our July 2014 Feature of the Month article. The algorithm implemented by Kognitio is here. There are also acompanying SQL and kog scripts.
This function is now generally available in Kognitio release 8.1.0rel140516b. Below is a walk through of a simple example of the SQL implementation. The function call is a row based function and works for ASCII/ LATIN1 strings:
Suppose you have a table containing a list of words you wish to compare, each with a unique ID:
yields the results:
Then we can calculate the edit_distance between these by doing a Cartesian join(cross join):
Full results are given below. There are 2 forms of the function without and with a threshold. The version with a threshold (specified by the optional 3rd parameter in the function) simply makes the function exit early if the specified threshold is exceeded.
A few things to note:
The algorithm implemented by Kognitio was developed by Berghel and Roach and can be found here.
For more details on the Kognitio implementation on some performance testing check out our July 2014 Feature of the Month article. The algorithm implemented by Kognitio is here. There are also acompanying SQL and kog scripts.
This function is now generally available in Kognitio release 8.1.0rel140516b. Below is a walk through of a simple example of the SQL implementation. The function call is a row based function and works for ASCII/ LATIN1 strings:
Code: Select all
SELECT string1
,string2
,edit_distance(string1,string2,[threshold])
FROM word_list
Code: Select all
SELECT * FROM word_list;
WORD  ID 
The  1 
the  2 
of  3 
V  4 
TV  5 
s  6 
and  7 
de  8 
Show  9 
a  10 
Code: Select all
select a.word word1
, b.word word2
, edit_distance(a.word,b.word) edit_dist_limit_none
, edit_distance(a.word,b.word,1) edit_dist_limit_1
from
sharon.word_list a,
sharon.word_list b
where a.id<b.id
order by a.id,b.id;
A few things to note:
 Differing Case is considered the same as any other edit. See first result row. One edit is required to change “The” into “the”. Use in conjunction with SQL command UPPER() or LOWER() to negate case.
 When using the function with a threshold included (1 in the example code) the result returned for a string comparison where the edit distance is greater than the limit is equal to limit + 1, see result row 2. This is useful if you want to obtain results with a certain value of the edit distance. The EDIT_DISTANCE function with the threshold can be applied in the WHERE predicate and computation stops as soon as the limit is exceeded.
 If the difference in the string lengths is greater than the threshold then this difference is returned from the EDIT_DISTANCE function call, see comparison of “s” and “Show” result. This is because the edit distance is always at least equal to difference in string lengths and this difference is used as the initial value of edit distance in the algorithm.
WORD1  WORD2  EDIT_DIST_LIMIT_NONE  EDIT_DIST_LIMIT_1 
The  the  1  1 
The  of  3  2 
The  V  3  2 
The  TV  2  2 
The  s  3  2 
The  and  3  2 
The  de  2  2 
The  Show  3  2 
The  a  3  2 
the  of  3  2 
the  V  3  2 
the  TV  3  2 
the  s  3  2 
the  and  3  2 
the  de  2  2 
the  Show  3  2 
the  a  3  2 
of  V  2  2 
of  TV  2  2 
of  s  2  2 
of  and  3  2 
of  de  2  2 
of  Show  3  2 
of  a  2  2 
V  TV  1  1 
V  s  1  1 
V  and  3  2 
V  de  2  2 
V  Show  4  3 
V  a  1  1 
TV  s  2  2 
TV  and  3  2 
TV  de  2  2 
TV  Show  4  2 
TV  a  2  2 
s  and  3  2 
s  de  2  2 
s  Show  4  3 
s  a  1  1 
and  de  3  2 
and  Show  4  2 
and  a  2  2 
de  Show  4  2 
de  a  2  2 
Show  a  4  3 
Who is online
Users browsing this forum: No registered users and 1 guest