humanmade/aws-rekognition

filter_query_attachment_keywords query mods cause significant performance degradation

nathanielks opened this issue · 5 comments

On sufficiently large postmeta tables, these query modifications are causing queries to take upwards of 60-90 seconds to complete. The postmeta table I'm running queries against has 7M+ rows and is 1.56G in size, 924M of data and 634M of index.

To set a baseline, I removed the modifications made to the query and ran it, resulting in a 2 second query time. This is our baseline.

  • Removing the GROUP BY reduces the query to 19 seconds.
  • Keeping the GROUP BY but removing the sq_hm_aws_rekognition_keywords LEFT JOIN, reduces the query to 25 seconds.
  • Removing GROUP BY and instead replacing it with SELECT DISTINCT gets the query down to 20 seconds and should return the same result set as using GROUP BY (they effectively do the same).

With that being said, would it be possible to instead store rekognition data in its own table that can be properly indexed?

The code in question:

/**
* Filter the SQL clauses of an attachment query to include keywords.
*
* @param array $clauses An array including WHERE, GROUP BY, JOIN, ORDER BY,
* DISTINCT, fields (SELECT), and LIMITS clauses.
* @return array The modified clauses.
*/
function filter_query_attachment_keywords( array $clauses ) : array {
global $wpdb;
if ( ! preg_match( "/\({$wpdb->posts}.post_content (NOT LIKE|LIKE) (\'[^']+\')\)/", $clauses['where'] ) ) {
return $clauses;
}
// Add a LEFT JOIN of the postmeta table so we don't trample existing JOINs.
$clauses['join'] .= " LEFT JOIN {$wpdb->postmeta} AS sq_hm_aws_rekognition_keywords ON ( {$wpdb->posts}.ID = sq_hm_aws_rekognition_keywords.post_id AND sq_hm_aws_rekognition_keywords.meta_key = 'hm_aws_rekognition_keywords' )";
$clauses['groupby'] = "{$wpdb->posts}.ID";
$clauses['where'] = preg_replace(
"/\({$wpdb->posts}.post_content (NOT LIKE|LIKE) (\'[^']+\')\)/",
'$0 OR ( sq_hm_aws_rekognition_keywords.meta_value $1 $2 )',
$clauses['where']
);
return $clauses;
}

Sometimes the SELECT DISTINCT query runs in 2 seconds as well, but it's not consistent. That's with SET SESSION query_cache_type=0; and using SELECT SQL_NO_CACHE, so I don't know if the query cache is truly disabled or if it's returning cached results.

Thanks @nathanielks, can you confirm this is on a site that doesn't use elasticsearch? Even 20 seconds is an absurd amount of time for a query to be running!

A custom table is a possible solution for sure. Maybe we could just put those keywords into the content or other wp_posts field used in the default search...

For our purposes we could switch off this query modification and rely on Elasticsearch to speed up the backend in the shorter term.

@roborourke correct, these are queries made directly to the database! No elasticsearch was being used.

Fixed for Altis, going to close this one out for the time being in favour of a more specific issue to toggle this behaviour via a filter.