datafold/data-diff

Warning during using hashdiff for VARCHAR type values in POSTGRES.

EnCeT opened this issue · 2 comments

EnCeT commented

Despite using a supported data type, while using hashdiff, I receive the warning '[PostgreSQL] Column 'email' of type 'Text(_notes=[])' has no compatibility handling.' I couldn't find information about this in the documentation, only that VARCHAR is supported for PostgreSQL.
documentation

data-diff "postgresql://postgres:Password1@localhost:5432/postgres" test2 "postgresql://postgres:Password1@localhost:5432/postgres" test -k id -c email -v -a hashdiff -d

01:20:09 INFO     [PostgreSQL] Starting a threadpool, size=1.                                                      base.py:1196
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SET TIME ZONE 'UTC'
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM
                  information_schema.columns WHERE table_name = 'test2' AND table_schema = 'public'
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT column_name, data_type, datetime_precision, numeric_precision, numeric_scale FROM
                  information_schema.columns WHERE table_name = 'test' AND table_schema = 'public'
         INFO     [PostgreSQL] Schema = {'id': ('id', 'integer', None, 32, 0), 'first_name': ('first_name',        schema.py:12
                  'character varying', None, None, None), 'second_name': ('second_name', 'character varying',
                  None, None, None), 'last_name': ('last_name', 'character varying', None, None, None), 'emailx':
                  ('emailx', 'character varying', None, None, None), 'email': ('email', 'character varying', None,
                  None, None), 'nulle1': ('nulle1', 'character varying', None, None, None), 'nulle2': ('nulle2',
                  'character varying', None, None, None)}
         INFO     [PostgreSQL] Schema = {'id': ('id', 'integer', None, 32, 0), 'first_name': ('first_name',        schema.py:12
                  'character varying', None, None, None), 'second_name': ('second_name', 'character varying',
                  None, None, None), 'last_name': ('last_name', 'character varying', None, None, None), 'emailx':
                  ('emailx', 'character varying', None, None, None), 'email': ('email', 'character varying', None,
                  None, None), 'nulle1': ('nulle1', 'character varying', None, None, None), 'nulle2': ('nulle2',
                  'character varying', None, None, None)}
         DEBUG    Available mutual columns: {'email', 'id', 'first_name', 'last_name', 'nulle1', 'emailx',      __main__.py:489
                  'nulle2', 'second_name'}
         INFO     Diffing schemas...                                                                             __main__.py:81
         INFO     Diffing using columns: key=('id',) update=None extra=('email',).                              __main__.py:518
         INFO     Using algorithm 'hashdiff'.                                                                   __main__.py:519
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT * FROM (SELECT TRIM("email") FROM "test2") AS LIMITED_SELECT LIMIT 64
         INFO     [PostgreSQL] Schema = {'id': Integer(_notes=[], precision=0, python_type=<class 'int'>),         schema.py:12
                  'email': Text(_notes=[])}
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT * FROM (SELECT TRIM("email") FROM "test") AS LIMITED_SELECT LIMIT 64
         INFO     [PostgreSQL] Schema = {'id': Integer(_notes=[], precision=0, python_type=<class 'int'>),         schema.py:12
                  'email': Text(_notes=[])}
         WARNING  [PostgreSQL] Column 'email' of type 'Text(_notes=[])' has no compatibility handling.   hashdiff_tables.py:133
                  If encoding/formatting differs between databases, it may result in false positives.
         WARNING  [PostgreSQL] Column 'email' of type 'Text(_notes=[])' has no compatibility handling.   hashdiff_tables.py:133
                  If encoding/formatting differs between databases, it may result in false positives.
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT min("id")::varchar, max("id")::varchar FROM "test2"
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT min("id")::varchar, max("id")::varchar FROM "test"
         INFO     Diffing segments at key-range: (1)..(3). size: table1 <= 2, table2 <= 2                    diff_tables.py:294
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT "id"::varchar, "email"::varchar FROM "test2" WHERE ("id" >= 1) AND ("id" < 3)
         DEBUG    Running SQL (PostgreSQL):                                                                         base.py:959
                  SELECT "id"::varchar, "email"::varchar FROM "test" WHERE ("id" >= 1) AND ("id" < 3)
         INFO     Diff found 0 different rows.                                                           hashdiff_tables.py:217
01:20:10 INFO     Duration: 1.38 seconds.                                                                       __main__.py:553

Same result i receive for columns with empty strings and null values:
data-diff "postgresql://postgres:Password1@localhost:5432/postgres" test2 "postgresql://postgres:Password1@localhost:5432/postgres" test -k id -c nulle2 -v -a hashdiff -d
I'm running data-diff version 0.9.17.

My testing tabels in PostgresSQL:

create table test (
	id serial primary key,
	first_name VARCHAR(50) NOT NULL,
	second_name VARCHAR(50) NOT NULL,
	last_name VARCHAR(50)NOT NULL,
	emailx VARCHAR(50),
    email VARCHAR(50),
	Nulle1 VARCHAR(20),
	Nulle2 VARCHAR(20)
);
insert into test (id, first_name, second_name, last_name, emailx, email, Nulle1, Nulle2) 
values
(1,	'Tisha',	'Annalise',	'Casserly',	'acasserly0@i2i.jp*',	'acasserly0@i2i.jp',	null, ''),
(2,	'Joelynn',	'Bette',	'Samuels',	'bsamuels1@networksolutions.com*',	'bsamuels1@networksolutions.com',	' ',	null);

insert into test2 (id, first_name, second_name, last_name, emailx, email, Nulle1, Nulle2) 
values 
(1,	'Tisha',	'Annalise',	'Casserly',	'acasserly0@i2i.jp*',	'acasserly0@i2i.jp',	'',	''),
(2,	'Joelynn',	'Bette',	'Samuels',	'bsamuels1@networksolutions.com*',	'bsamuels1@networksolutions.com',	null,	NULL)

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.