Nosto/nosto-magento2

Performance Issue with Large Catalogs and Deep Paging

Closed this issue · 1 comments

Magento 2 version(s) used: 2.4.3
Extension version(s) affected: 5.3.2

Description
Two Issues that are related: (See below for SQL query samples)
The code in \Nosto\Tagging\Model\Service\Update\QueueService::addCollectionToUpsertQueue uses \Nosto\Tagging\Util\PagingIterator. (Persumably with \Nosto\Tagging\Model\ResourceModel\Magento\Product\Collection).

  1. The first issue is that Magento's core code runs an expensive query on every page as it re-calculates the \Magento\Framework\Data\Collection\AbstractDb::getSize after every call to clear in \Nosto\Tagging\Util\PagingIterator::page's call to \Magento\Framework\Data\Collection::clear.
  • The suggested fix here depends on how you handle point two, it might be possible to ignore this issue if you switch to, load id followed by id based paging as detailed below. Else, you can probably supress this logic via a couple of methods as appropriate.
  1. The second isssue is that PagingIterator uses LIMIT OFFSET paging; this is very, very, expensive as you get deeper in the page count. MySql needs to sort/filter through the first OFFSET products in order to find the next LIMIT products. This cost is per query, so as you get deeper it get more and more expensive.
    There is no real method to mitigate this short of switching to another form of paging.
    You can:
  • Load all the entity_ids followed by using the loaded ids to create batches (This will likely fix issue one) (This could have memory issues, but large sites should have enough ram to hold X product_ids in memory).
    • See: \Magento\Catalog\Model\ResourceModel\Product\Collection::getAllIdsCache/getAllIds
  • Use the last entity_id/row_id loaded to set the offset, and drop the OFFSET clause. (via a >entity_id clause, along with an order on entity_id) (This will likely not fix the count issue above). (This would require some testing)

How to reproduce
Run bin/magento indexer:reindex nosto_index_product_queue on a relativly large collection. (2550 Should show the queries used, but would not be slow as the paging and filtering is not deep).
Use some method to show the queries made; either query-log or adding in debugging logging.

Magento 2 mode

  • Developer
  • [x ] Production

Full page cache

  • [x ] Enabled
  • Disabled

Nosto indexer mode

  • Update On Save
  • [x ] Update By Schedule

Possible Solution
See above

Additional context

SELECT COUNT(DISTINCT e.entity_id)                                                                                                                                                                                                               FROM `catalog_product_entity` AS `e` INNER JOIN `catalog_category_product_index_store1` AS `cat_index` ON cat_index.product_id=e.entity_id AND cat_index.store_id=1 AND cat_index.visibility IN(3, 2, 4) AND cat_index.category_id=2 INNER JOIN `catalog_product_entity_int` AS `at_status_default` ON (`at_status_default`.`row_id` = `e`.`row_id`) AND (`at_status_default`.`attribute_id` = '96') AND `at_status_default`.`store_id` = 0 LEFT JOIN `catalog_product_entity_int` AS `at_status` ON (`at_status`.`row_id` = `e`.`row_id`) AND (`at_status`.`attribute_id` = '96') AND (`at_status`.`store_id` = 1) INNER JOIN `catalog_product_entity_int` AS `at_visibility_default` ON (`at_visibility_default`.`row_id` = `e`.`row_id`) AND (`at_visibility_default`.`attribute_id` = '102') AND `at_visibility_default`.`store_id` = 0 LEFT JOIN `catalog_product_entity_int` AS `at_visibility` ON (`at_visibility`.`row_id` = `e`.`row_id`) AND (`at_visibility`.`attribute_id` = '102') AND (`at_visibility`.`store_id` = 1) WHERE ((IF(at_status.value_id > 0, at_status.value, at_status_default.value) = 1) AND (IF(at_visibility.value_id > 0, at_visibility.value, at_visibility_default.value) != 1)) AND (e.created_in <= '1637600400') AND (e.updated_in > '1637600400')
SELECT `e`.*, `cat_index`.`position` AS `cat_index_position`, IF(at_status.value_id > 0, at_status.value, at_status_default.value) AS `status`, IF(at_visibility.value_id > 0, at_visibility.value, at_visibility_default.value) AS `visibility` FROM `catalog_product_entity` AS `e` INNER JOIN `catalog_category_product_index_store1` AS `cat_index` ON cat_index.product_id=e.entity_id AND cat_index.store_id=1 AND cat_index.visibility IN(3, 2, 4) AND cat_index.category_id=2 INNER JOIN `catalog_product_entity_int` AS `at_status_default` ON (`at_status_default`.`row_id` = `e`.`row_id`) AND (`at_status_default`.`attribute_id` = '96') AND `at_status_default`.`store_id` = 0 LEFT JOIN `catalog_product_entity_int` AS `at_status` ON (`at_status`.`row_id` = `e`.`row_id`) AND (`at_status`.`attribute_id` = '96') AND (`at_status`.`store_id` = 1) INNER JOIN `catalog_product_entity_int` AS `at_visibility_default` ON (`at_visibility_default`.`row_id` = `e`.`row_id`) AND (`at_visibility_default`.`attribute_id` = '102') AND `at_visibility_default`.`store_id` = 0 LEFT JOIN `catalog_product_entity_int` AS `at_visibility` ON (`at_visibility`.`row_id` = `e`.`row_id`) AND (`at_visibility`.`attribute_id` = '102') AND (`at_visibility`.`store_id` = 1) WHERE ((IF(at_status.value_id > 0, at_status.value, at_status_default.value) = 1) AND (IF(at_visibility.value_id > 0, at_visibility.value, at_visibility_default.value) != 1)) AND (e.created_in <= '1637600400') AND (e.updated_in > '1637600400') ORDER BY `e`.`created_at` DESC LIMIT 500 OFFSET 14500

Should be fixed with out 7.0 release