are there any ways to de-duplicate?

Question

are there any ways to de-duplicate?

Closed this issue a year ago · 6 comments

Some of my human facing task producers tend to spam impatiently, I would love to catch those duplicate tasks.
I could get tasks and compare the hash of value, but I'm worried of the task getting claimed meanwhile. Do you have any solutions?

Answer 1 · 2023-06-20T15:16:31.000Z

This can indeed be a problem in human-facing task creation. It kind of depends on what you are doing with the tasks, but setting up multiple queues for the task to traverse can really help.

Human queue - accepts tasks, including duplicate ones, from whatever UI you use to submit things.
Work queue - worker pulls from human queue, calculates hash, makes empty entry in database (with the hash and task ID), then does the work only if the hash hasn't been seen before. When finished, fills out the rest of the database entry with the results and deletes the task.

Worst case, if the database is written and the task is deleted and resubmitted, the worker ignores the new one because its entry is already full and complete (depends on how you calculate your hash).

That is how we did it for a couple of DARPA projects.

Another approach is to calculate the task UUID before insertion, based on information in the task, and then insert with that specific ID. The system can then simply not do that (there's a config setting for this), but this requires that every task that is inserted into the system live forever, for example, moving to a "completed" queue instead of merely being deleted.

So the database approach is usually easiest for people to reason about.

Answer 2 · 2023-06-28T11:11:54.000Z

Having a separate queue to consume -> clean -> publish to queue is a great solution.
Though, my usecase is slightly more niche:
I want to de-duplicate a->a->a->b->a->a into a->b->a, where a and b are idempotent tasks in isolation.

Some solution I can think of is:

I could claim the last item in the queue, compare and skip adding to the queue if hash matches.
I could also:
- claim 2 items from the top of the queue at once
- decide to skip if both are same by:
  - deleting the older item in the queue
  - return the other unmodified
- repeat

I'm not sure claiming more than one at once is a good idea, also not sure about claiming the last item

Answer 3 · 2023-06-28T13:28:20.000Z

Lots of interesting thoughts in here. A couple of things that might help ground the discussion:

It's not possible to claim two things at once from a single queue, that would break the abstraction and introduce unwanted race conditions.
Any time you are attempting to claim a specific kind of task among others inside a queue, you're going to run into races, including "last task" and other kinds of qualities inherent in it.

It sounds like you rely pretty heavily on ordering, which makes EntroQ (or TaskMaster, or solutions underlying Conductor) a poor fit: ordering is explicitly omitted.

You may want to consider Kafka or NATS.io for your use case, or some other pubsub mechanism, if deduplicating ordered streams is part of the use case, since they are built around ordered delivery.

It's true that you will get duplicates in all of these cases (due to consistency theory), but if you want to collapse an ordered stream, then you probably need an ordered stream to begin with, and EntroQ is purposefully unordered. :)

That said, I'd be curious to hear more about your specific use case. Quite often a quick rethink of the surrounding system can yield interesting results. I'd be happy to have a private conversation about it sometime if you would like to not have it out in the open on github, or we can get into it here, whatever is comfortable.

To summarize the principles that might govern your choices:

If you need ordering (i.e., order is part of the critical information in your system) in a single queue, you don't want EntroQ: the only ordering you can impose safely with EntroQ or similar systems is ordering between queues, never within them. In other words, treat queues as nodes in a state machine graph and you're okay.
If you feel the need to do queue "grooming", e.g., removing duplicates from a queue, you technically can, but definitely shouldn't. That breaks the entire competing-consumer abstraction. The entire idea is that a consumer can claim a task without worrying about what any other consumer is doing, ever. When you engage in queue grooming, you are saying "I care about what other consumers are going to do with this queue", breaking the abstraction, and introducing the need for additional synchronization primitives.

Finally, you can mix and match. Sometimes it makes a lot of sense to have a kafka-like solution for task intake, where ordering might be important, and deduplication might be desirable in an ordered stream. You can, for example, have a worker subscribed to a kafka topic, pulling things down and deduplicating them as it goes, then putting the deduplicated tasks into EntroQ for workers to pick up and push along the state machine graph (between queues). That works great, and actually answers a lot of use cases for which people only use Kafka, but probably should use two mechanisms.

The mistake most folks seem to make in large system design is to assume that one task storage solution is enough. It can be in some cases, but in every real world use case I have encountered, I typically need all of

A pubsub (Kafka, NATS.io)
A database (SQL, noSQL, whatever)
A task manager (EntroQ)

That might be true for your case, too. Let's find out!

Answer 4 · 2023-06-28T19:49:27.000Z

I have been delaying adding kafka to the architecture, since I'm still in early stages of building the system.
Like you mentioned, I believe there are other opportunities in my system that could be altered to have a simpler/better solution.
I'd love to have a conversation over a call to explain my broader case and relay some ideas and learn from you :)
Do you mind if I shoot you an email and setup a call?

Answer 5 · 2023-06-28T20:07:11.000Z

Sure, send me an email, always open.

…

On Wed, Jun 28, 2023 at 3:49 PM Ahmad Saleem ***@***.***> wrote: I have been delaying adding kafka to the architecture, since I'm still in early stages of building the system. Like you mentioned, I believe there are other opportunities in my system that could be altered to have a simpler/better solution. I'd love to have a conversation over a call to explain my broader case and relay some ideas and learn from you :) Do you mind if I shoot you an email and setup a call? — Reply to this email directly, view it on GitHub <#44 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAR4P4BCSMQO5CON555MSMDXNSDFFANCNFSM6AAAAAAZLDGEMU> . You are receiving this because you commented.Message ID: ***@***.***>

Answer 6 · 2023-06-29T09:07:16.000Z

Thank you so much! I have sent you a mail :) Let's continue there.