eulerto/pg_similarity

Needleman-Wunsch crashes the Postgres server

shapiro2 opened this issue · 14 comments

I have a database with a table of 3460 organizations. I found that the Needleman-Wunsch operator and function crash my server.

Here is the query that crashes the server:

select o1.organization_id, o1.organization_name,
       o2.organization_id, o2.organization_name,
        needlemanwunsch(o1.organization_name, o2.organization_name)
 from organizations o1
   left join organizations o2 on o2.organization_id != o1.organization_id 
                             and o1.organization_name ~#~ o2.organization_name
where o1.organization_id in( 4088, 3935, 3700, 3652)

[SELECT - 0 rows, 1.138 secs]  [Code: 0, SQL State: 08006]  An I/O error occurred while sending to the backend.

Here is a query using gram that works fine:

select o1.organization_id, o1.organization_name,
       o2.organization_id, o2.organization_name,
        qgram(o1.organization_name, o2.organization_name)
 from organizations o1
   left join organizations o2 on o2.organization_id != o1.organization_id 
                             and o1.organization_name ~~~ o2.organization_name
where o1.organization_id in( 4088, 3935, 3700, 3652)

organization_id organization_name               organization_id organization_name                qgram              
--------------- ------------------------------- --------------- -------------------------------- ------------------ 
3652            U. of Arizona                   (null)          (null)                           (null)             
3700            Rensselaer Polytechnic Insitute 330             Rensselaer Polytechnic Institute 0.9253731343283582 
3935            Shiv Nadar University           (null)          (null)                           (null)             
4088            University of Pittsbugh         586             University of Pittsburgh         0.9019607843137255 
4088            University of Pittsbugh         2185            University of Pisa               0.7111111111111111 

The server info is:

PostgresSQL 9.3.14 on x86_64-unknonwn-linux-gnu, compiled by gcc (Debian 4.9.2.-10) 4.9.2, 64-bit

Could you get the stack trace? Take a look at [1] for instructions. Basically, it is:

ulimit -c unlimited
pg_ctl start -D /path/to/pgdata
psql -c "my query here" mydb
[... crash ...]
gdb /path/to/postgres /path/to/pgdata/core
(gdb) bt

[1] https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD

I probably won't be able to do this anytime soon.

I could provide you the 3400+ organization_names, if you are able to set up a test for yourself and see if you can replicate the problem.

Here is a csv of the organizations

org.txt

Please send me the file, I will test over the weekend. What is your OS?

%uname -a

Linux mshapiro 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt11-1+deb8u3 (2105-08-04) x86_64 GNU/Linux

To speed-up, can you also share the import code and table creation?
I would like to make sure you used the one of the LATIN* encoding. Can you see in your db entries like 2910,"University of León", with the proper accent?

Anyway, even with the wrong encoding, the query shouldn't have failed.

I can see the accent.

CREATE DATABASE mshapiro
  WITH OWNER = mshapiro
       ENCODING = 'UTF8'
       TABLESPACE = pg_default
       LC_COLLATE = 'en_US.UTF-8'
       LC_CTYPE = 'en_US.UTF-8'
       CONNECTION LIMIT = -1;
CREATE TABLE organizations
(
  organization_id integer not null,
  organization_name character varying(300) NOT NULL,
  CONSTRAINT pk_organizations PRIMARY KEY (organization_id)
  CONSTRAINT org_name_uk UNIQUE (organization_name)
)
WITH (
  OIDS=TRUE
);

@shapiro2 thanks for the test case. I reproduced the bug here. I'll fix it later.

Great. I think I may have hit a similar bug with one or two other measures (same query), but I didn't keep track so I can't be sure which ones. I'll run the query with other measures and let you know if I hit it or not. Might be a few days before I can get back to you on it. I know lev and qgram worked OK.

I ran the same query for all but the hamming and soundex operators.
They all worked except needlemanwunsch (reported above).

But NOTE that the ~?? operator (jacard) makes this operator difficult to use in perl since the perl DBI uses the ? as a placeholder and wants to substitute a bound value for the ?.

@shapiro2 Take a look at [1] to solve your placeholder problem.

[1] http://blog.endpoint.com/2015/01/dbdpg-escaping-placeholders-with.html

Thanks. I saw that. In general, though, I have run into the ? issue with other interfaces, too, not just Perl.
For example, in one tool I have to use ~????. Also, for our system, we don't have the laster version of DBD::Pg installed and we have so many scripts using DBD::Pg in production that upgrading the version of DBD::Pg for this is not something that our admins will do. I personally don't think I will use this operator, but it might be a good idea to offer an alternative (keep ~??, but also have one like ~--)

@shapiro2 There are so many operators and few options to consider. Someone already suggested to deprecate this operator for the same reason. I'll consider it in the next release. You can always replace this operator with another one that has a different name (drop the old one and create another -- look at pg_similarity--1.0.sql).

@shapiro2 Thanks for your report. Fixed in be1a8b0.