neo4j-contrib/neomodel

Feature Request: Multiple related node match in a single query

rmtobin opened this issue · 8 comments

Feature description

It would be amazing if I was able to fetch a Node/NodeSet plus given related Nodes in a single query, ala Django's select_related/prefetch_related. Being able to do this would alleviate painful N+1 query problems that are very easy to run into using Neomodel's query system.

For example, say I want to populate a list of books with their authors and genres. I have Book nodes, off of which I have relationships to Author nodes, and Genre nodes. As it is, without manually writing a Cypher query, I have to first query all the Book nodes, then, for each Book node, fetch each related Author and Genre node separately. This guarantees now that I have to run 2N+1 queries, where N is the number of Book nodes (which in my case could be many hundreds, and I'm actually fetching more relationships than two).

Ideally, it'd be nice if Neomodel did something similar to Django's select_related for querysets, so instead I could do something like Book.nodes.select_related('authors', 'genres').all() and have Neomodel intelligently generate a single query to fetch everything all at once. It would be extra nice if it also had a NodeSet cache (like select_related) where it caches those results, so then subsequent queries like book_node.genres.all() within the same execution frame, where book_node was in the NodeSet returned in my earlier query, would be able to be fulfilled by the cache without another query to the database. This would make using Neomodel in the context of say a GraphQL server much less painful, among other benefits.

Considered alternatives

Generally I've had to resort to hand-writing Cypher and managing my own rudimentary query cache, which defeats a lot of the point of using an ORM/OGM.

How this feature can improve the project?

It would allow users to create more complex, more performant queries, and make it more tenable to use Neomodel for highly-related/high-cardinality, large volume data. I do realize this is a difficult and complex feature, but it would bring it closer to on par to many of the major SQL-based ORMs, which makes it that much easier to justify adopting Neo4j.

@rmtobin Hello and thanks for this proposal.

Let me get this straight:

At the moment, what you would do is:

  1. Retrieve the books with something like: all_books = Book.nodes.all()
  2. Given a book, access some property like: book_authors = all_books[0].authors.all()

Or maybe (continuing from above):

book_data = [ (a_book, a_book.authors.all(), a_book.genres.all()) for a_book in all_books]

Which would "buffer" all related information about a book in separate entries in book_data.

If we agree that this is so, so far (?), then, what you are proposing is automatically triggering the some_book.<some_relationship>.all() for each entity that is fetched so that it is already there when you traverse the relationship to collect the nodes on its other side.

Which is doable (for example, with a feature that specifies "pre-trigger all relationships down to depth k for each entity that is received from the database", plus some optimisation to avoid multiple triggers of relationships that have already traversed)...

BUT, it would not save you the "trouble" of having to visit each entity to collect the data from an entity's relationships (the book_data list comprehension from above).

Can you please confirm if this is what you are saying?

Kind of but not quite. The problem is not that I don't want to write code to fetch relationships - as you showed, it is just a list comprehension to do so, but rather that each loop generates a separate database query to fetch the relationships, which then results in a lot of queries if there's a lot of iterations in the loop (this is the classic N+1 query problem). So instead I'd like to be able to write an expression that 1. Fetches a set of Nodes 2. Also fetches Nodes I specify that are related to those nodes 3. Does so in a single query/call to the database.

Also to clarify, I definitely wouldn't want to automatically fetch all of a node's relationships, but rather be able to specify which ones to fetch.

For another example, taking the expression from my previous comment, Book.nodes.select_related('authors', 'genres').all(), I'd expect the OGM to translate that roughly into something like:

MATCH (b:Book)
OPTIONAL MATCH (a:Author)-[:WROTE]-(b)
OPTIONAL MATCH (g:Genre)-[:HAS_GENRE]-(b)
return b, collect(a) as authors, collect(g) as genres

And now I have Books, their authors, and genres all from one query to the database. This is of course simplified, but hopefully communicates the general idea.

There's some details around where to put the related nodes once they've been fetched that would have to be figured out - I personally like the way Django does it, where say Book.nodes.select_related('authors', 'genres').all() returns a NodeSet of Book nodes (just like if you didn't have the select_related), and the related nodes go into a cache, so then when you do book_node.authors.all() later on it gets fulfilled by the cache and doesn't cause a database call. Doing it that way lets you write code just like your example with the list comprehension, but doesn't cause an N+1 query problem.

@rmtobin

I see.

So we are talking about a "utility" function that automatically discovers the subgraph implied by a set of nodes and writes a single query to return the nodes and relationships within that subgraph only.

And, as it can happen in a data model, two entities might be related in more than one ways, in that case, also specify a relationship to resolve any ambiguities.

To what depth would you expect select_related to work? Here for example, Author and Genre are immediately related to Book. If you specified a node that is "twice removed" from Book, would you still expect the function to work out the "path" by which these two entities are related and construct the query?

tonioo commented

@rmtobin Could it be related to this: #686?

@rmtobin Could it be related to this: #686?

Also have a look here, where we added an example in the documentation, and a disclaimer as to what it does, and does not do so far

@tonioo @mariusconjeaud That is effectively what I want, thanks for linking that.

An object cache would be nice (as is mentioned in that PR), but it's workable without it. There is one problem I do foresee that makes it less useful to me - it looks like currently all and match relationship queries don't return NodeSets, they execute and return lists, so you can't use fetch_relationships with it. For example if you have:

user = User.nodes.get(some_prop='foo')
read_books = user.read.all() <---- This returns a list of Nodes when zero/one or more cardinality, not a NodeSet

read_books = user.read.all().fetch_relationships('author')  <----- Thus, this wouldn't work, but I would expect it to

Any idea when you'll cut a new release with fetch_relationships in it? Will definitely try it out when you do.

Hello @rmtobin : the release is almost ready ! Took it slow because the scope of changes in the core were pretty vast (we are bumping to a 5.x Neo4j driver among other things).

It should be this week hopefully (my goal is tomorrow, maybe Thursday)

@rmtobin : 5.1.0 was released just now !