running unique jobs?
maia opened this issue · 2 comments
This is more of a question than an issue, but after reading the announcement I'm very excited about the option of using gush:
- Is it possible to enforce jobs only being run once (for a certain amount of time), similar to the "unique jobs" feature in Sidekiq Enterprise?
Here's an example of my use case: my app parses a stream incoming tweets, and for each url in a tweet it will follow the redirects of the url (it might be a short url), then parse the html-file, then search for images, then download these images. If multiple tweets within a batch contain the same url, I do not want to re-query this url (within a certain amount of time, if I re-encounter it a day later, well then I will re-query it).
Currently (with Sidekiq 4.1) race conditions of multiple parallel workers cause urls to be enqueued in parallel. I certainly could save that an url is enqueued in sql/redis and check for that when queueing, but this adds complexity to the code. Also this is just one, simplified example.
What I'm looking for is a way to enqueue the processing of an array of tweets, process them in parallel (including optional followup-jobs such as parsing the html, extracting keywords, downloading images), and once all tweets have completed with all these possible sub-routines, followup with another job. Therefor:
- Can I design a workflow to not run a job after all batches of the previous job have completed, but after all batches of the previous job plus all followup-jobs of these batches have completed?
(Having each job enqueue followup-jobs also gives me the advantage that while some workers may still process tweets, others might parse a html-document, and others might already process images. This is helpful as it will not first run everything that's CPU-bound to then run everything that's I/O bound, and I'd prefer to spread this out even more, but am unsure how I could do so)
Is this a use case where gush will help? Thanks a lot.
Update: it looks as if my second question is related to #26 .
-
what you are describing isn't a use case Gush was created for. You should preprocess that batch of tweets before and then extract unique URLs from them. Only after that, process them in parallel.
-
yes, second one is related to #26. Additionally you can look up dynamically adding jobs in the workflow: #24 (comment)
Do share an example if you manage to work it out with Gush, I'm curious!
Closing for inactivity