datafold/data-diff

Joindiff on the same table failng - Clickhouse

hamer101 opened this issue · 4 comments

Comparing a table with itself with joindiff in Clickhouse with the command:

data-diff  "clickhouse://clickhouse:Password1@localhost:9000/clickhouse" student student  -k id  -c deathdate  -v

Results in:

10:31:05 INFO     [Clickhouse] Starting a threadpool, size=1.                                                                                                                                                                                                     base.py:1196
         INFO     [Clickhouse] Schema = {'id': ('id', 'UInt32', None, 32, 0), 'first_name': ('first_name', 'String', None, None, None), 'last_name': ('last_name', 'String', None, None, None), 'email': ('email', 'String', None, None, None), 'gender':         schema.py:12
                  ('gender', 'String', None, None, None), 'birthdate': ('birthdate', 'Date', 0, None, None), 'deathdate': ('deathdate', 'Nullable(Date)', 0, None, None)}                                                                                                     
         INFO     [Clickhouse] Schema = {'id': ('id', 'UInt32', None, 32, 0), 'first_name': ('first_name', 'String', None, None, None), 'last_name': ('last_name', 'String', None, None, None), 'email': ('email', 'String', None, None, None), 'gender':         schema.py:12
                  ('gender', 'String', None, None, None), 'birthdate': ('birthdate', 'Date', 0, None, None), 'deathdate': ('deathdate', 'Nullable(Date)', 0, None, None)}                                                                                                     
         INFO     Diffing schemas...                                                                                                                                                                                                                            __main__.py:81
         INFO     Diffing using columns: key=('id',) update=None extra=('deathdate',).                                                                                                                                                                         __main__.py:518
         INFO     Using algorithm 'joindiff'.                                                                                                                                                                                                                  __main__.py:519
         INFO     [Clickhouse] Schema = {'id': Integer(_notes=[], precision=0, python_type=<class 'int'>), 'deathdate': UnknownColType(_notes=[], text='Nullable(Date)')}                                                                                         schema.py:12
         INFO     [Clickhouse] Schema = {'id': Integer(_notes=[], precision=0, python_type=<class 'int'>), 'deathdate': UnknownColType(_notes=[], text='Nullable(Date)')}                                                                                         schema.py:12
         INFO     Validating that the are no duplicate keys in columns: ['id']                                                                                                                                                                          joindiff_tables.py:243
         INFO     Validating that the are no duplicate keys in columns: ['id']                                                                                                                                                                          joindiff_tables.py:243
         INFO     Diffing segments at key-range: (1)..(1001). size: table1 <= 1000, table2 <= 1000                                                                                                                                                          diff_tables.py:294
         INFO     . Diffing segment 1/32, key-range: (1)..(32), size <= None                                                                                                                                                                            joindiff_tables.py:183
         INFO     . Diffing segment 2/32, key-range: (32)..(63), size <= None                                                                                                                                                                           joindiff_tables.py:183
         INFO     . Diffing segment 3/32, key-range: (63)..(94), size <= None                                                                                                                                                                           joindiff_tables.py:183
         INFO     . Diffing segment 4/32, key-range: (94)..(125), size <= None                                                                                                                                                                          joindiff_tables.py:183
         INFO     . Diffing segment 5/32, key-range: (125)..(156), size <= None                                                                                                                                                                         joindiff_tables.py:183
         INFO     . Diffing segment 6/32, key-range: (156)..(187), size <= None                                                                                                                                                                         joindiff_tables.py:183
         INFO     . Diffing segment 7/32, key-range: (187)..(218), size <= None                                                                                                                                                                         joindiff_tables.py:183
         INFO     . Diffing segment 8/32, key-range: (218)..(249), size <= None                                                                                                                                                                         joindiff_tables.py:183
         INFO     . Diffing segment 9/32, key-range: (249)..(280), size <= None                                                                                                                                                                         joindiff_tables.py:183
         INFO     . Diffing segment 10/32, key-range: (280)..(311), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 11/32, key-range: (311)..(342), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 12/32, key-range: (342)..(373), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 13/32, key-range: (373)..(404), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 14/32, key-range: (404)..(435), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 15/32, key-range: (435)..(466), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 16/32, key-range: (466)..(497), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 17/32, key-range: (497)..(528), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 18/32, key-range: (528)..(559), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 19/32, key-range: (559)..(590), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 20/32, key-range: (590)..(621), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 21/32, key-range: (621)..(652), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 22/32, key-range: (652)..(683), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 23/32, key-range: (683)..(714), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 24/32, key-range: (714)..(745), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 25/32, key-range: (745)..(776), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 26/32, key-range: (776)..(807), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 27/32, key-range: (807)..(838), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 28/32, key-range: (838)..(869), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 29/32, key-range: (869)..(900), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 30/32, key-range: (900)..(931), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 31/32, key-range: (931)..(962), size <= None                                                                                                                                                                        joindiff_tables.py:183
         INFO     . Diffing segment 32/32, key-range: (962)..(1001), size <= None
10:31:06 ERROR    Code: 62.                                                                                                                                                                                                                                    __main__.py:348
                  DB::Exception: Syntax error: failed at position 134 ('distinct'): distinct from "tmp2"."id" THEN 1 ELSE 0 END AS "is_diff_id", CASE WHEN "tmp1"."deathdate" is distinct from "tmp2"."deathdate" THEN 1 ELSE 0 END AS "is_diff_deat. Expected                
                  one of: NOT DISTINCT FROM, NULL, NOT NULL, THEN, OR, AND, IS NOT DISTINCT FROM, IS NULL, IS NOT NULL, BETWEEN, NOT BETWEEN, LIKE, ILIKE, NOT LIKE, NOT ILIKE, REGEXP, IN, NOT IN, GLOBAL IN, GLOBAL NOT IN, MOD, DIV. Stack trace:                          
                                                                                                                                                                                                                                                                          
                  0. DB::Exception::Exception(DB::Exception::MessageMasked&&, int, bool) @ 0x000000000cdcf077 in /usr/bin/clickhouse                                                                                                                                          
                  1. DB::Exception::createDeprecated(String const&, int, bool) @ 0x00000000079b6bad in /usr/bin/clickhouse                                                                                                                                                    
                  2. DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, String const&, bool, unsigned long, unsigned long) @ 0x00000000142afedc in /usr/bin/clickhouse                                                                                    
                  3. DB::executeQueryImpl(char const*, char const*, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum, DB::ReadBuffer*) @ 0x0000000012bdd397 in /usr/bin/clickhouse                                                                          
                  4. DB::executeQuery(String const&, std::shared_ptr<DB::Context>, bool, DB::QueryProcessingStage::Enum) @ 0x0000000012bdc65e in /usr/bin/clickhouse                                                                                                          
                  5. DB::TCPHandler::runImpl() @ 0x0000000013aa16a9 in /usr/bin/clickhouse                                                                                                                                                                                    
                  6. DB::TCPHandler::run() @ 0x0000000013ab6059 in /usr/bin/clickhouse                                                                                                                                                                                        
                  7. Poco::Net::TCPServerConnection::start() @ 0x000000001658e334 in /usr/bin/clickhouse                                                                                                                                                                      
                  8. Poco::Net::TCPServerDispatcher::run() @ 0x000000001658f531 in /usr/bin/clickhouse                                                                                                                                                                        
                  9. Poco::PooledThread::run() @ 0x000000001669ac47 in /usr/bin/clickhouse                                                                                                                                                                                    
                  10. Poco::ThreadImpl::runnableEntry(void*) @ 0x0000000016698edc in /usr/bin/clickhouse                                                                                                                                                                      
                  11. ? @ 0x00007ff1583b6609 in ?                                                                                                                                                                                                                             
                  12. ? @ 0x00007ff1582db133 in ?                                                                                                                                                                                                                             

I attach the full trace here.

I'm running data-diff version 0.9.17.

have the same issue, cant diff tables in Clickhouse((
This construction not work in CH:

...
CASE WHEN "tmp1"."id" is distinct from "tmp2"."id" THEN 1 ELSE 0 END         AS "is_diff_id",
...

Is thereis exist way to handle this?

Currently seems just need add to clickhouse dialect (clickhouse.py)

    def is_distinct_from(self, a: str, b: str) -> str:
        return f"(({a} IS NULL) != ({b} IS NULL)) OR ({a}!={b})"

This issue has been marked as stale because it has been open for 60 days with no activity. If you would like the issue to remain open, please comment on the issue and it will be added to the triage queue. Otherwise, it will be closed in 7 days.

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment and it will be reopened for triage.