Forum

General discussion on using the Kognitio Analytical Platform.
Contributor
Offline
Posts: 38
Joined: Mon Jan 06, 2014 10:36 am

Edit Distance - Kognitio Feature of the Month

by skkirkham » Thu Jun 26, 2014 4:35 pm

Some recent work with clients has resulted in Kognitio implementing a function for calculating the Levenstein Edit Distance between 2 strings. This function is designed to give a measure of string similarity by finding the minimum number of basic edits (insertion, deletion or substitution) to convert one string into another.

For more details on the Kognitio implementation on some performance testing check out our July 2014 Feature of the Month article. The algorithm implemented by Kognitio is here. There are also acompanying SQL and kog scripts.

This function is now generally available in Kognitio release 8.1.0rel140516b. Below is a walk through of a simple example of the SQL implementation. The function call is a row based function and works for ASCII/ LATIN-1 strings:

Code: Select all

SELECT string1
,string2
,edit_distance(string1,string2,[threshold])
FROM word_list
Suppose you have a table containing a list of words you wish to compare, each with a unique ID:

Code: Select all

SELECT * FROM word_list;
yields the results:
WORDID
The1
the2
of3
V4
TV5
s6
and7
de8
Show9
a10
Then we can calculate the edit_distance between these by doing a Cartesian join(cross join):

Code: Select all

select a.word word1
, b.word word2
, edit_distance(a.word,b.word) edit_dist_limit_none
, edit_distance(a.word,b.word,1) edit_dist_limit_1
from
sharon.word_list a,
sharon.word_list b
where a.id<b.id
order by a.id,b.id;
Full results are given below. There are 2 forms of the function without and with a threshold. The version with a threshold (specified by the optional 3rd parameter in the function) simply makes the function exit early if the specified threshold is exceeded.

A few things to note:
  1. Differing Case is considered the same as any other edit. See first result row. One edit is required to change “The” into “the”. Use in conjunction with SQL command UPPER() or LOWER() to negate case.
  2. When using the function with a threshold included (1 in the example code) the result returned for a string comparison where the edit distance is greater than the limit is equal to limit + 1, see result row 2. This is useful if you want to obtain results with a certain value of the edit distance. The EDIT_DISTANCE function with the threshold can be applied in the WHERE predicate and computation stops as soon as the limit is exceeded.
  3. If the difference in the string lengths is greater than the threshold then this difference is returned from the EDIT_DISTANCE function call, see comparison of “s” and “Show” result. This is because the edit distance is always at least equal to difference in string lengths and this difference is used as the initial value of edit distance in the algorithm.
WORD1WORD2EDIT_DIST_LIMIT_NONEEDIT_DIST_LIMIT_1
Thethe11
Theof32
TheV32
TheTV22
Thes32
Theand32
Thede22
TheShow32
Thea32
theof32
theV32
theTV32
thes32
theand32
thede22
theShow32
thea32
ofV22
ofTV22
ofs22
ofand32
ofde22
ofShow32
ofa22
VTV11
Vs11
Vand32
Vde22
VShow43
Va11
TVs22
TVand32
TVde22
TVShow42
TVa22
sand32
sde22
sShow43
sa11
andde32
andShow42
anda22
deShow42
dea22
Showa43
The algorithm implemented by Kognitio was developed by Berghel and Roach and can be found here.
Reply with quote Top

Who is online

Users browsing this forum: No registered users and 2 guests

cron