biolink/kgx

neo_sink.py constraints not created in neo4j v5+

frdougal opened this issue · 0 comments

Describe the bug
Neo4j 5+ changed the syntax for creating indices. The current neo_sink.py code will not generate indices before loading the data resulting in very long load times. You can see the indices never get created if you query the neo4j instance using the show indexes; command.

To Reproduce
Run through the steps to generate a TSV file containing nodes and edges. Then run the [load_tsv_to_neo4j.py] code to load the data into a neo4j v5+ instance. It will take a long time to load and each subsequent batch will take longer to load.

Expected behavior
Two things are expected to happen. 1) The load time should be the same as loading data into neo4j v4.x. 2) The neo4j v5+ instance should contain several unique indices for the id property of each node label.

Code snippets
To fix this: change line 287 of neo_sink.py to this:
query = f"CREATE CONSTRAINT IF NOT EXISTS FOR (n:{category}) REQUIRE n.id IS UNIQUE"

Additional context
You may be able to add a check to the neo_sink.py code to switch the CREATE CONSTRAINT string based on neo4j version by examining the data returned by this query:
call dbms.components() yield name, versions, edition unwind versions as version return name, version, edition;