walmat/nebula-old

Implement Shared Monitors

pr1sm opened this issue · 5 comments

pr1sm commented

Consolidate instances of Monitor class based on the site of each task runner. The monitor implementation is inefficient because each task runner uses a separate instance of the monitor to look for products. This means the tasks running the same site each have to make requests to the site for data, leading to faster bans and more time spent making requests.

Instead, monitoring should be consolidated based on the site. The monitor class should be converted to a TaskRunner-like process. This Monitor will be able to asynchronously add/remove product data that should be monitored. This means that the matching functions used need to be updated to work for multiple product groups, both in the input and output.

The monitors will then be managed similarly to how the task runners are managed using a MonitorManager (see #442)

Related to #436 whoops 😅

pr1sm commented

The current direction I'm heading with this is to split the work up of fetching products, matching products, fetching detailed product list, then notifying a manager of matched products. The current monitor relies on the Parser class to perform these functions at once, but this is only for one product input and one parsing type at a time. To create a monitor capable of multiple product inputs and multiple parsing types, a shared monitor needs to have access to each of the individual parts of parsing. This will allow us to build a shared dataset from all parsers on a single site, then apply matching on the same dataset for the different product input types.

Parsing Changes

I've already started working on this change, but breaking up the run method of the parsers is the first change that needs to get done. I'm focusing on the basic parsers first, then handling the special parsing case afterwards. Since a shared monitor will be specific to one site, it's possible that a subclass of the shared monitor can be created for special site parsing and a different implementation could be used. Because of that, I'm skipping the changes to special parsing until we get a clearer picture of the basic parsing and how it interacts with the new shared monitor.

  1. Break the run method up to the two main pieces: fetch and match. This will allow us to fetch the data from all parsers separately, then use a single match implementation per parsing type. Some parsing methods also include a separate step in the run method to fetch detailed data, but that is already broken out into a static function in the base Parser class, so we don't have to worry about it.
  2. Adjust the utility functions to allow running a single function to generate multiple matches. The current utility functions take 1 pass through the incoming product data and get 1 match. The new functions should take 1 pass through the incoming data and get all available matches.
  3. This needs to be done in particular for keyword matching. The current function will need to allow multiple keyword inputs and should return the single product matches for each keyword input.
  4. Implement the _parseAll method in the SharedMonitor class. The steps of this are detailed pretty clearly in the comments on the issue_441 branch. The breaking up of the Parser run method into separate steps should allow this to be implemented.
  5. The _parseAll method should be broken up into the 3 _handle* methods I've stubbed out
    1. _handleParse - this is where the parser fetch method should be used to generate the product dataset that should be used for matching
    2. _handleFilter - This is where the parser match methods should be used to generate product data (or product urls) based on the parse type. The product inputs should be first separated by parse type, then the parse-type-specific match methods should be used to get the relevant info
    1. keywords will use the utility function to receive matches. If matches include the full product info, it should be included, otherwise a url to the full info should take the place of the match
    2. variants will shortcircuit since they've all ready been "matched"
    3. urls will return the url to the full info.
      3. _handleProcess - This is where the full product info is received for all matched products that have incomplete data. Some consolidation will need to be done here since a product input by keywords could be the same as a product input by url. Once the url fetching is done, we should have a full product object for each product input. Size parsing can then be run on this set to return the valid variants for each product input.
State Machine Changes

I've done some proof of concept work to allow each of the _handle* methods to work concurrently instead of being run in a single state machine, but the parsing implementation needs to get overhauled first before this can really be implemented.

The main idea in this stage is to perform the _handleParse and _handleFilter methods in one "thread" while another "thread" performs the _handleProcess method. This will allow the fetch/matching of the full data set to happen separately from the product url requests.

The product url requests will then need to be performed sequentially (or throttled) to prevent proxy softbans from occuring. The throttled nature of requests can also allow the fetch/match "thread" to change the url requested if a newer match is received.

Sorry man. I haven't been able to do any work on this yet as I just not got the checkout rewrite to a place where i feel comfortable switching over.

Where's a good place to start? I know you mentioned breaking the run/fetch methods up in the parsers, but idk if you've touched on that at all.

Let me know. Hope you're having a good time in Cali 🤝

pr1sm commented

No problem! I actually started working on this a little, so I’ll push my changes up. If you want to work on the structure of the state machine changes, that would be a good place to start since it would also apply to the checkout module. The basic gist would be running two async functions and coordinating requests so we don’t get a softban. The first async function would perform fetching and matching, then it would update some type of queue that would be used by the second async function to fetch the full product info of matched items

No problem! I actually started working on this a little, so I’ll push my changes up. If you want to work on the structure of the state machine changes, that would be a good place to start since it would also apply to the checkout module. The basic gist would be running two async functions and coordinating requests so we don’t get a softban. The first async function would perform fetching and matching, then it would update some type of queue that would be used by the second async function to fetch the full product info of matched items

Sounds good! I'll hop onto a new branch and get that going.