stefankroes/ancestry

Adding a "/" at the end can reduce the amount of "OR" queries

KaviiSuri opened this issue · 4 comments

It is pretty common in my experience to get all descendants of a record in the tree.
Currently, a descendant of a record can have "ancestry" column of the following forms:

  • "{id}"
  • "{id}/{another_id}/.."
    Because of this behaviour, the query to get descendants of id 3 is
SELECT "{table_name}".* FROM "{table_name}" WHERE ("{table_name}"."ancestry" LIKE '3/%' OR "{table_name}"."ancestry" = '3')

The "OR" in the query seems unnecessary and can be easily removed if we just add "/" at the end of ancestry always, effectively changing the forms to

  • "{id}/"
  • "{id}/{another_id}/.."
    This way, the query is just
SELECT "{table_name}".* FROM "{table_name}" WHERE "{table_name}"."ancestry" LIKE '3/%'

Is there a specific reason ancestry doesn't do this and has decided to not have a "/" at the end of single ancestors?

Outstanding resources

The research materials/presentations for tree structures in a number of programming languages all had the same implementation. And that implementation without the trailing slash looks obviously wrong.

Implementation

I have been pulling the materialized path implementation into a separate class so we could support multiple tree implementations: traditional materialized path, the trailing slash, postgres arrays, ltree, path with the id in it.

Check out: https://github.com/kbrock/ancestry/tree/materialized_path2
I just rebased so it should be in a usable state.

Results

When I ran the numbers for trailing slash it was not any faster.

Issues I see:

  • we need to stop using unicode encoding for the column.
  • We still have way too many OR clauses
  • We do not have good data sets to show if we actually have improved anything.

Future steps

While I totally agree that the current implementation that everyone uses is probably wrong, I would like to actually know which actions/queries use indexes and which table scan.

So I keep getting derailed by wanting a test suite to actually test this stuff out.

@kbrock Were the results using the pg array approach better? In fact, I was looking into this not because of speed, but because having a trailing and maybe a leading slash would make implementing custom scopes a much cleaner and easier task.

@dimvic Yes. I too want a way to determine the root_id, parent_id, ancestor_ids, and child_ancestry values in pure sql without resorting to using a CASE. The current implementation sure leans towards wanting to use ruby to parse.

I want to use a serializer to parse/create the ancestry value. If we do that, then the value of ancestry would be an array - basically the same as ancestor_ids. It would remove a lot of complexity. But I don't know how much this breaks the code that uses ancestry. I know in our code, there are some assumptions that ancestry is a string with a / delimiter.

Very tempted to breaking this functionality in the next major version of ancestry. But I'm hesitant too

Let me know if you end up trying out the strategy of materialized_path2 and if it works alright for you.