berlindb/core

Discussion - JOIN clauses and their place

Closed this issue · 15 comments

One of the more confusing things about BerlinDB is the inability to create queries that create JOINs in any meaningful way. This is mostly due to how the caching works around shape_items.

From how I understand it, BerlinDB's Query class fundamentally does this:

  1. Parse the query
  2. Fetch the IDs from the database
  3. Loop through each of these items, and query for the records. This is also known as "shaping" in BerlinDB.
  4. Cache each record as it is looped, so in the future it can be fetched faster.

The reasoning for this is because it makes it possible to cache individual records, and provide a lot of flexibility in getting as many records as possible with less effort.

The problem with this methodology is that it makes it impossible to really manipulate the query, which I think is a design decision, and probably a good thing.

But what happens when I want to make a query that JOINs two tables? There really isn't a way to-do that in BerlinDB.

It would be really nice if there were some way to utilize the cache functionality baked into BerlinDB in the context of a JOINED row, as well.

I wonder if this could be somehow achieved by creating BerlinDB row classes, and schemas that match what the JOINed resultset would look like.

I think this may give BerlinDB the context necessary to cache these results even though this isn't technically a database table. This would definitely require some changes to how Berlin's Query class works, but I think it might make it possible to query data very much like how Query handles requests now.

It would almost like we're treating these queries like a view, without actually making a view.

I definitely would love to see joins. I deliberately didn't use BerlinDB in one project because the project required joins everywhere and BerlinDB couldn't do it unless I overwrote a ton of methods. (Use case: a book database, where a single book record touches multiple tables -- at minimum you're always querying books and authors because you always want a book with its author(s).)

Related-ish, but might warrant its own issue:

I'd love to see BerlinDB do "eager loading" like Laravel does. https://laravel.com/docs/7.x/eloquent-relationships#eager-loading

Querying a book and its author(s) looks like this:

$books = App\Book::with('author')->get();

The actual SQL looks like this:

select * from books

select * from authors where id in (1, 2, 3, 4, 5, ...)

But in the end you get a neatly mapped array of objects like this:

[
    {
        "book_id": 1,
        "title": "Book Title",
        "authors": [
            {
                "author_id": 1,
                "name": "Author Name"
            }
        ]
    },
    {
        "book_id": 2,
        "title": "Title of Second Book",
        "authors": [
            {
                "author_id": 2,
                "name": "Author Name"
            }
        ]
    }
]

(You can see this doesn't actually do any joins, hence probably warranting its own issue.)

JJJ commented

Y’all are completely on track so far.

There are 3 main reasons JOINs aren’t currently in Berlin:

  1. A JOIN will change the md5 hash of the MySQL query.
    If you want to query for both books and authors, two separate queries will provide a higher probability of either/both types of objects hitting the cache.
  2. Meta Queries are JOINs, but aren’t always good ones.
    The WP_Meta_Query class (and hence Queries\Meta in Berlin) are the only example of a JOIN query in WordPress (besides Terms which are hardcoded), but in most Meta type circumstances a UNION would actually be exponentially more performant. Ideally, Berlin would be intelligent enough to make this decision just-in-time.
  3. There is no way to know what tables are JOINable.
    In order to JOIN between two tables, they need to be on the same database server. For example, in highly scalable horizontal WordPress environments (something like WordPress.org/.com) the users and usermeta database tables always exist in a central database server, and cannot be JOINed against. That means any database query that JOINs Posts with Users will fail.
    Ultimately, this is why even WordPress multisite does not JOIN multiple Sites together to aggregate posts or comments; their tables might not be JOINable. Truthfully, object/meta pairings are really the only safe JOINs in WordPress (post/comment/term/user/blog/site) though post/term/comment usually exist on the same server.

——

The solution (as I’ve imagined it until now) is for Berlin to have a simple way to define a Relationship in PHP between a Column in one table and a Column in another table. These relationships could be one-way or two-way, and could be auto-priming, JOINable, UNIONable, and so on.

A hypothetical Relationship object ultimately would resolve all 3 of the above problems.

  1. They would decide if two queries are better than 1 via one-way/two-way
  2. They run underneath the Meta query implementation relating row object IDs to meta row object IDs, allowing for “automatic priming and shaping” of rows in the related tables
  3. They would define what tables are reliably JOINable across what database tables in environments where tables might be on separate servers

I forget where, but someplace in Berlin a class does reserve a $relationships array and an $aliases array, though they aren’t ever used. I’d hoped to get this but ran out of time, so I shimmed the current Meta implementation in how it exists today so that it would be working well enough to use in EDD3 and SC.

Years ago, I’d started a plugin to help with object-to-object relationships (to replace the aging posts-to-posts plugin) with a many-to-many-to-many table, but I didn’t get to fully finish it.

https://github.com/stuttter/wp-relationships

Some inspiration exists in there for how a Relationship class could exist in Berlin.

JJJ commented

Relatedly, the reason why row IDs are queried for and then primed individually relates to caching, sharding, and application reliability.

It’s possible to configure WordPress cache to use separate servers for any individual cache group/bucket. For example, queries for many posts could be on a separate device from queries for single posts.

Next, some database plugins introduce sharding to large database tables. It’s a mechanism to break tables apart and relate them together once they hit X number of rows. For example, wp_posts_1 and wp_posts_2 each having 10 million rows in them. Sharding is a low-level API that should happen inside of $wpdb without Berlin needing to worry about it. One of these tables could be locked, causing a single post to be unretrievable instead of all of them.

Next, individual posts may have already been cached by a previous and different posts query, so there is no reason to query the database for some of them again. Because each query is hashed for uniqueness, but each post ID is cached by its ID, we can loop through each one, checking the cache, and priming the cache on a missing ID. If Berlin just shaped all of the found rows every time, that would result in chatter to either the cache or the database server that wouldn’t ultimately be necessary.

Lastly, doing this type of loop per-row minimizes the results of any failure from the above complications down to each individual failed single row. If the database server or the cache server are overloaded during the exact moment a transaction is being attempted, that single operation needs to do a little bit more work rather than all of them failing at once.

I do understand that all of the above goes against a fundamental MySQL concept, which is the ability to write a single, elegant database query to get everything you’re looking for all at once. That concept works really well for simple applications with relatively small amounts of data, but ultimately ends up being inflexible for applications that do interactive object caching or use multiple servers (like most PHP/WordPress ones.)

Very cool, thanks for sharing that @JJJ. I was able to figure out some of that from reading, but the context really helps.

So, here is a specific use-case I have for JOIN. We'll stick with books and author for the sake of simplicity:

SELECT * FROM books
RIGHT JOIN authors ON books.author_id=author.id
WHERE authors.name='Alex' AND books.pulblish_date>'2020-06-27'

Basically, in the query above, the goal is to find all books written by Alex after June 27th, 2020.

In this circumstance, the query needs to filter across tables. As far as I know, there is no way to get this same result set with two queries instead of one.

If this were to be done in BerlinDB, what would the query look like?

I would guess that using relationships in the schema would make the query as straightforward as this:

$query = new Book_Query(array(
    'date_query' => array( 'before' => '2020-06-27'),
    'name'          => 'Alex'
));

But it's really not that simple. What happens if books has a column, name? Berlin would not know which column to reference. Besides, how does Book_Query know when to include author fields, or not? Would I explicitly have to tell it to get those fields?

Instead, I wonder if it would be best to make a query specifically for this join relationship

$query = new Book_Author_Query(array(
    'date_query'    => array( 'before' => '2020-06-27'),
    'author_name' => 'Alex',
));

I envision this would return a set of Book_Author_Rows. As suggested above, this would handle all of its own unique caching, and provide a space for specialized methods to handle rows in this context.

JJJ commented

If I understand correctly, I would probably do something like:

// No books yet
$books = array();

// Look for authors named Alex
$authors = new Author_Query( array(
    'name' => 'Alex'
) );

// Assuming WordPress is the environment, pluck out the IDs from the array of authors
$author_ids = wp_list_pluck( $authors, 'id' );

// If Authors...
if ( ! empty( $author_ids ) ) {

	// Look for books by Authors before June 27, 2020
	$books = new Book_Query( array(
		'author_id__in' => $author_ids,
		'date_query'    => array( 'before' => '2020-06-27' )
	) );
}

// Here's your books, Andy Dufresne
var_dump( $books ); die;

Now you can interact with both arrays freely. You have an array of Author objects and maybe the array of Book objects by that author (and you don't need to worry about Column naming collisions between tables).

Oh duh. I forgot about __in. That's great!

One small improvement would probably be to make Author_Query pass fields => ids to keep the query light. Of course, in my example you still need that data, so...

I suppose is fundamentally eager loading, as @ashleyfae described above.

For me, this begs the question: when would a JOIN clause be strictly necessary, then?

I wonder if it would be better to utilize this two-query method, and lean on the cache at that point, instead.

Something like:

	/**
	 * Merges two queries into a single result set.
	 * This method provides a way to run JOIN-like queries while still utilizing BerlinDB's cache.
	 *
	 * @since 1.2.3
	 *
	 * @param array  $primary    Primary query arguments.
	 * @param Query  $join_query Query to join
	 * @param string $on
	 * @return array
	 */
	public function join_query( $primary, Query $join_query, $on ) {

		// If the join query has no results, bail.
		if ( empty( $join_query ) ) {
			return [];
		}

		$ids                     = wp_list_pluck( $join_query->items, 'id' );
		$primary[ $on . '__in' ] = $ids;

		$primary_query   = $this->query( $primary );
		$results         = [];
		$join_columns    = wp_list_pluck( ( new $join_query->table_schema )->columns, 'name' );
		$primary_columns = wp_list_pluck( ( new $primary_query->table_schema )->columns, 'name' );

		// Assemble the objects
		foreach ( $primary_query->items as $key => $primary_item ) {
			$join_query_item = wp_list_filter( $join_query->items, [ 'id' => $primary_item->$on ] )[0];
			$result          = new \stdClass();

			foreach ( (array) $primary_item as $row_key => $row_value ) {
				if ( in_array( $row_key, $primary_columns ) ) {
					$result->$row_key = $row_value;
				}
			}

			foreach ( (array) $join_query_item as $row_key => $row_value ) {
				if ( in_array( $row_key, $join_columns ) ) {
					$result->$row_key = $row_value;
				}
			}

			$results[] = $result;
		}

		return $results;
	}

Theoretically, this could be expanded into something more robust, that could handle the caching. I think the extra context from the queries should be enough to create a hash to cache these individual records, as well. This would be pretty important because overall this method is quite expensive. Lots of array lookups.

This may actually make it plausible to avoid needing to create all of the extra classes and schemas I discussed above. But would that be a good thing?

JJJ commented

when would a JOIN clause be strictly necessary, then?

That is an interesting question. 😄

When every table is controlled by Berlin (including Meta tables) most queries are going to be better off not being JOINs. There are LEFT, RIGHT, INNER, and OUTER JOINs that would need covering via an API. All of those types of JOINs can be replicated currently with some combination or __in or __not_in once you know which set of data needs to influence the other.

JOIN diagram stolen from a web search

66zgg

But would that be a good thing?

Won't know until we try? It's always possible I'm simply forgetting some obvious thing. Usually @ashleyfae is really good at going "hey John what about XYZ" and then I go "crap you're right." 🤣

The other thing is... MySQL doesn't stop you from writing whatever kind of crazy query you would want to, OUTER JOINing a dozen tables and sorting and grouping by specific stuff. I don't think Berlin needs to be an abstraction for every type of possible combination of MySQL clause (that's more what @DrewAPicture's Claws class was starting to do). I think Berlin is better suited as the PHP registry for the tables, with all of the simple operations one would normally take on it, and enough extensibility for additional parsers and classes to be bolted onto it or extended from it.

I wonder if it would be better to utilize this two-query method, and lean on the cache at that point, instead.

I do think, eventually something very similar to that is ultimately what Berlin should end up with. A way to look at the Column Relationships and decide what kind of queries are possible, and what kind of cache priming is necessary, and what kind of "shape" or format should they be sent back in (objects, JSON, GraphQL, etc...)

Was literally about to post this exact issue, since I needed to JOIN tables for my new plugin using BerlinDB. Would love to see a relationships class added to make this easier for sure! Definitely a good way to go to enable this.

The example above given by @JJJ is good enough for now, but it's a bit messy if we are passing in a ton of IDs or whatever to that query class. Not sure how optimised this would be on large data sets either (say passing in 20,000 Ids into that clause and using list pluck, etc), probably fine for most case though. I always try to avoiding touching PHP when the database can do it faster.

I'm primarily concerned about large data sets as well. It works fine for smaller ones, but I don't feel the two query method scales very well once you're looking into thousands or hundreds of thousands (and beyond).

JJJ commented

Thanks everyone for chiming in. This is great!

Using @davidsherlock's example of 20,000 IDs, an intersection that big probably shouldn't use any JOIN and would want to use a subquery instead. That being said, subqueries are pretty difficult to codify, because they are potentially infinitely and recursively nested. 🔁

Obviously there aren't hard-and-fast rules about any of this stuff, and the needs of every application are definitely different. Maybe Berlin should explore supporting all kinds of JOINs? That way, it could start to make recommendations on how to optimize certain queries?

Quick WordPress Audit

Here are some notes after a quick search through the WordPress Core codebase. We should definitely look more at Laravel for inspiration.

WordPress & Meta

Here is an interesting snippet out of WP_Meta_Query. It only ever allows LEFT or INNER type JOINs.

// We prefer to avoid joins if possible. Look for an existing join compatible with this clause.
$alias = $this->find_compatible_table_alias( $clause, $parent_query );
if ( false === $alias ) {
	$i     = count( $this->table_aliases );
	$alias = $i ? 'mt' . $i : $this->meta_table;

	// JOIN clauses for NOT EXISTS have their own syntax.
	if ( 'NOT EXISTS' === $meta_compare ) {
		$join .= " LEFT JOIN $this->meta_table";
		$join .= $i ? " AS $alias" : '';

		if ( 'LIKE' === $meta_compare_key ) {
			$join .= $wpdb->prepare( " ON ($this->primary_table.$this->primary_id_column = $alias.$this->meta_id_column AND $alias.meta_key LIKE %s )", '%' . $wpdb->esc_like( $clause['key'] ) . '%' );
		} else {
			$join .= $wpdb->prepare( " ON ($this->primary_table.$this->primary_id_column = $alias.$this->meta_id_column AND $alias.meta_key = %s )", $clause['key'] );
		}

		// All other JOIN clauses.
	} else {
		$join .= " INNER JOIN $this->meta_table";
		$join .= $i ? " AS $alias" : '';
		$join .= " ON ( $this->primary_table.$this->primary_id_column = $alias.$this->meta_id_column )";
	}

	$this->table_aliases[] = $alias;
	$sql_chunks['join'][]  = $join;
}

Outer Join

The only LEFT OUTER JOIN in WordPress happens using a subquery, when querying users by their post count (note the todo to try and remove that JOIN later):

} elseif ( 'post_count' === $orderby ) {
	// @todo Avoid the JOIN.
	$where             = get_posts_by_author_sql( 'post' );
	$this->query_from .= " LEFT OUTER JOIN (
		SELECT post_author, COUNT(*) as post_count
		FROM $wpdb->posts
		$where
		GROUP BY post_author
	) p ON ({$wpdb->users}.ID = p.post_author)
	";
	$_orderby          = 'post_count';

Inner Joins

There are quite a few INNER JOINs in WordPress, which isn't surprising and none of them are particularly interesting.

Subqueries

These are pretty great, and super powerful, but also a challenge to conceptualize translating to PHP due to their recursive nature.

Others

WordPress doesn't use any other JOINs, but that doesn't mean Berlin shouldn't support them, or that other applications won't have a need to deploy them in their database strategies.

I suspect that, in some fashion, these different join types would be somehow integrated using the pattern briefly discussed in #55, where they would exist as different query processors.

Maybe the query processor is autoloaded if a field in the schema declares that this field uses a particular processor. I think that specificity may make it possible for Query to be smart enough to know to use whatever join is necessary based on what fields are included in a query.

Using Books and Authors again:

$query = new Book_Query(array(
    'date_query' => array( 'before' => '2020-06-27'),
    'name'          => 'Alex'
));

name would be specified as a relational item with authors.name. In this class, it would also specify the query processor that needs to run in-order to connect these items.

So you would end up with at least 3 items to make this connection:

  1. The book schema
  2. The author schema
  3. The query processor that understands how to join the two

It's also possible that a 4th item, a factory class, would exist in the Schema to serve as a service locator, but this may be just fine as an array.

I'm going to go ahead and close this. It was a great discussion, but it seems that it's mostly concluded. I would encourage anyone who wishes to continue this discussion to open a new thread in Discussions