ICIJ/datashare-extension-neo4j

feat: as an ADMIN of `datashare-extension-neo4j`, I should be able to run bulk imports for documents and named entities

Closed this issue · 0 comments

Before merging

  • merge #29
  • test the neo4j-admin import CLI on an docker image

PR description

Fixes #25

This PR adds the ability for admins to proceed to bulk imports (from an empty DB), using the neo4j admin import CLI. This CLI is optimized for large imports, however the it requires the target DB to be empty.

Admin import flow

Export documents and named entities from ES into neo4j formatted csvs:

curl -X POST <datashare>/api/neo4j/admin/neo4j-csvs?project=my-project -d '{"query": {}}' > export.tar.gz

a query can also be provided to reduce the scope of the export:

curl -X POST <datashare>/api/neo4j/admin/neo4j-csvs?project=my-project -d '{"query": {"ids": ["doc-0", "doc-1"]}}' > export.tar.gz

Then decompress the archive and proceed to a dry run in order to control the command which will be executed against the DB:

tar xzvf export.tar.gz
./bulk-import.sh --dry-run

it should print something like:

./bin/neo4j-admin import full \
--skip-bad-relationships \
--database some-specific-db \
--nodes=Document="docs-header.csv,docs.csv" \
--nodes="entities-header.csv,entities.csv" \
--relationships=HAS_PARENT="doc-roots-header.csv,doc-roots.csv" \
--relationships=APPEARS_IN="entity-docs-header.csv,entity-docs.csv"