Curious to know why bulk loader spec is designed for new graph only?

Question

Curious to know why bulk loader spec is designed for new graph only?

gomesian opened this issue 5 years ago · 3 comments

Wondering why we cant update an 'existing' GRAPH with bulk load? (e.g. getting Redis KEY already exists error)

I assume redisgraph-py is not as efficient as using this script - since I assume it doesn't use the GRAPH.BULK endpoint e.g. sending binary data that is getting unpacked

I love the sparse matrix design of RedisGraph. I am prototyping some thing for my Post Grad ML capstone project.
I need to be able up do BULK UPDATE existing graph every few minutes with new steaming data - up to 50k node/edges. Not sure how the redisgraph-py chunk/buffering works on larger data inserts, but will test next.

Any comment future plans to add BULK UPDATE Redis Graph bulk-loader? Is do you think the transport protocol of the redisgraph-py is as efficient for existing GRAPH updates using bulk binary chunks?

Answer 1 · 2021-01-19T17:15:27.000Z

Hi @gomesian,

The bulk loader is only designed for new graphs because, if the current design were naively extended, we could only perform batch updates of entirely new entities. Performing tasks such as adding relationships between pre-existing nodes would not work, and there is a significant risk of unintentional entity duplication.

We're currently considering adding a bulk update utility that would allow for CSVs to be loaded by writing Cypher queries with references to CSV fields. Such a utility's performance would fall between the bulk loader and redisgraph-py update scripts, but would require the graph to be inaccessible for the duration of the bulk update.

Answer 2 · 2021-01-19T18:45:49.000Z

Thanks for the info you are considering bulk update utility.
Makes sense then some sort of db write locking feature, in case returned edge/node IDs has changed before the actually update phase finishes (thus mitigating errors or broken promise of guaranteed updates in a point in time)

Just for some background:

I want to update node properties and add new edges (mix of existing and new node) - ideally in parallel. And it creates nodes if not already created. Somewhat of a living graph updated frequently. Maybe RedisGraph not so suited to this, compared it being more suited static/one time Graph.

I previously tried AWS Neptune GraphDB for this personal project which has incredible overheads. But I did like the fact you can set entity key as int or string. It appears RedisGraph supports only auto-generated IDs as integers (correct me if I am wrong). I like Neptunes bulk_loader API I can create a node (my own key) - then simply batch push new nodes/edges with key of my choosing. If node exists - Neptune simply updates/replace property object. If new, it is created. I recall even an option/method to append / update subset of properties (instead of replace all props). So for updates, I don't need to search first for the key for the update. I simply blindly post update with desired/known key. For mass update of edges (with known key to key), obviously up to me to ensure node exists - and how to handle errors if it doesn't.

I will close this but will wait a bit first if you have comments.

Answer 3 · 2021-01-19T18:56:18.000Z

Nevermind.. finding MERGE in the documentation. This is probably what I can use (sort of combines MATCH and CREATE)