Implementation of Tao-Deng-Stonebraker approximate string joins with abbreviations for PostgreSQL.
Approximate string joins are implemented as a C-language extension for PostgreSQL.
The extension consists of three components, each of which introduces a user-defined function. Their definitions are given below.
The extension follows a common pipeline for PostgreSQL C-language extensions.
Before installation, make sure you have PostgreSQL server installed. Cluster state (running or stopped) does not matter.
-
Build
- Open command line. Change working directory to where extension files (this git repository) reside;
- Run
make
. You may have to installlibpq-dev
if it was not installed with PostgreSQL; - Run
make install
. You need certain privileges (probablyroot
) for this to work, as PostgreSQL binary files are modified.
-
Install
- Login to PostgreSQL as superuser (
postgres
); - Run
CREATE EXTENSION "mipt-asj";
.
- Login to PostgreSQL as superuser (
To perform JOIN
with approximate comparison of strings, you need a set of rules (full form
<=> abbreviation
).
-- Generate abbreviation rules
CREATE TABLE rules(f VARCHAR, a VARCHAR);
INSERT INTO rules(f, a) (
SELECT f, a FROM mipt_asj.calc_dict(
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'f',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'a'
)
);
The produced table has a simple form: every tuple is a pair (f, a)
. You may freely add, delete or modify the rules. Alternatively, instead of generating rules, you may supply your own ones.
The following code conducts approximate string JOIN:
-- Select the strings to join (may return false-positive results)
CREATE TABLE to_join(s1 VARCHAR, s2 VARCHAR);
INSERT INTO to_join(s1, s2) (
SELECT s1, s2 FROM mipt_asj.calc_pairs(
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'c1',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'MY_TABLE'), 'c2',
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'rules'), 'f', 'a',
0.7
)
);
-- JOIN
SELECT DISTINCT t1.s1, t1.s2
FROM to_join AS t1 INNER JOIN to_join AS t2 ON mipt_asj.cmp(
t1.s1,
t1.s2,
(SELECT oid FROM pg_catalog.pg_class WHERE relname = 'rules'), 'f', 'a',
0.7
);
The extension interface is a few user-defined functions. All functions are placed in schema mipt_asj
.
mipt_asj.calc_dict(full_OID, full_column, abbr_OID, abbr_column)
.
Calculates abbreviation rules.
- Call parameters:
full_OID
. Full forms table OIDfull_column
. Full forms table column nameabbr_OID
. Abbreviations table OIDabbr_column
. Abbreviations table column name
abbr_OID
may be equal to full_OID
.
- Returns: table. Each tuple is an abbreviation rule. Fields:
f
. Full form of abbreviationa
. Abbreviation
The rules are calculated using a trie.
mipt_asj.calc_pairs(1_OID, 1_column, 2_OID, 2_column, rules_OID, rules_full_column, rules_abbr_column, exactness)
.
Selects equal (in terms of Tao-Deng-Stonebraker (pkduck
) metric) string pairs. Some pairs may be falsely considered equal; use cmp
to filter out such pairs.
- Call parameters:
1_OID
. 1st string set table OID1_column
. 1st string set table column name2_OID
. 2nd string set table OID2_column
. 2nd string set table column namerules_OID
. Rules table OIDrules_full_column
. Rule's full forms column namerules_abbr_column
. Rule's abbreviation column nameexactness
. Exactness parameter, a real number in range [0; 1]
Try to set exactness
equal to 0.7
if you are not sure about its value.
1_OID
may be equal to 2_OID
.
- Returns: table. Each tuple is a pair of equal (in terms of the metric mentined above) strings. Fields:
s1
. String from table1_OID
, column1_column
s2
. String from table2_OID
, column2_column
mipt_asj.cmp(string_1, string_2, rules_OID, rules_full_column, rules_abbr_column, exactness)
.
Compares two strings using Tao-Deng-Stonebraker (pkduck
) metric.
- Call parameters:
string_1
. 1st string to comparestring_2
. 2nd string to comparerules_OID
. Rules table OIDrules_full_column
. Rule's full forms column namerules_abbr_column
. Rule's abbreviation column nameexactness
. Exactness parameter, a real number in range [0; 1]
Try to set exactness
equal to 0.7
if you are not sure about its value.
If calc_pairs
was used with some exactness
value, the same should be set as exactness
in call to this function.
- Returns: boolean.
Feel free to open an issue on GitHub!
This is alpha version in development. Both extension API and internals may change. Use cautiously!
The code is available under MIT license.