a2aproject/a2a-python

[Feat]: Add Redis-backed QueueManager for Production Deployments

Closed this issue · 3 comments

Is your feature request related to a problem? Please describe.

The current A2A Python SDK only provides InMemoryQueueManager, which makes it impossible to run agentic applications in production environments. In distributed setups like Kubernetes with multiple pods, the in-memory queue cannot share state between instances, leading to:

  • Lost messages between pods
  • Inconsistent task state across the cluster
  • Impossible to scale horizontally
  • No persistence of events across pod restarts

Describe the solution you'd like

Implement a Redis-backed QueueManager that uses Redis Streams for reliable, distributed event queuing. This would enable:

  • Production-ready deployments in Kubernetes and other distributed environments
  • Horizontal scaling across multiple pods
  • Persistent event storage and recovery
  • Consistent state management across the cluster

Describe alternatives you've considered

  • Database-backed queues (more complex, higher latency)
  • Message queue systems like RabbitMQ (additional infrastructure complexity)
  • Shared memory solutions (not viable in containerized environments)

Additional context

Redis is already widely used in agentic AI platforms like LangGraph and provides the perfect balance of performance, reliability, and simplicity for distributed event streaming. Many serverless and microservices architectures already use Redis, making this a natural fit for production A2A deployments.

Reference implementations needed: Are there any existing Redis queue implementations in the A2A ecosystem that could serve as a reference?

This is a duplicate of: #269

I agree with the underlying need here - the memory-based queue definitely doesn't work in distributed systems 👍. However, I don't think vendor-specific implementations should be included in the core SDK.

Instead, we should provide abstracted interfaces that allow anyone to integrate their preferred queue manager with A2A. This approach keeps the core SDK focused and prevents it from becoming bloated with vendor-specific code. It also provides a clear path for users to implement support for any queue system while avoiding the precedent where we're expected to maintain implementations for every possible vendor ("if Redis is supported, why not RabbitMQ/SQS/Kafka/etc.").

Would you be open to exploring an interface-based approach instead? This way we can address the legitimate need for distributed queuing while keeping the architecture clean and extensible.

@lukehinds Thank you for the feedback and for pointing out the similarity to #269 - I agree they're related and both aim to address distributed queuing for production, though the implementations differ (streams vs pub/sub).

Spec constraint: The A2A protocol requires HTTP(S) transport. That implies the queue's role is buffering / ordering / persistence for the SDK; actual delivery from the queue to a peer should still use the SDK’s HTTP endpoints so the spec remains honored. The queue should not replace the HTTP transport in the default flow.

Two separate but related efforts:

  1. Distributed queuing (HTTP-compliant): Provide a pluggable QueueAdapter/DistributedQueue in core and move durable implementations (Redis Streams, Pub/Sub, etc.) to adapter packages. This addresses immediate production needs (durability, multi-node scaling) without changing the protocol.
  2. Event-driven transport: Fully event-driven transports (e.g. Kafka as explored in #434) are a different axis—useful for async/event-first architectures but they deviate from the HTTP MUST and should be optional or require a spec extension/opt-in.

What I can do next:

  • Refactor this PR to add the DistributedQueue interface + unit tests to the core repo (no vendor code in core).
  • Move the Redis Streams implementation into a separate http-extensions directory in-repo. This keeps the core vendor-agnostic and small while offering production-ready implementations.

I’m focused on the HTTP-compliant queue first because it unblocks current production use. If maintainers prefer that split (core interface + out-of-core adapters), I’ll start the work and open the PR(s).

Thanks for the suggestion, however the A2A SDK is designed to not be tied to specific vendors/proprietary systems. See this response from #269 (comment)

Sorry was OOO for the long weekend. To clarify a few things:

  • The DatabaseTaskStore is not tied to a specific vendor. It uses sqlalchemy library which allows for many DB implementations to be directly used.
  • I agree we should not have a vendor specific implementation in the SDK. The pattern we are following is to define the right interfaces/logic in the SDK and show a proof of concept with a specific vendor solution in the samples repo. This way you can verify your abstractions are valid for at least one vendor, and hopefully others can implement different versions for different vendors until the SDK abstractions are correct.

To that end, I think it makes sense to commit an ProducerConsumerQueueManager or similar which provides the right logic and abstractions around using an distributed producer/consumer paradigm with the QueueManager interface. Then you can implement the redis based producer/consumer instance (in the samples repo) to prove the end-to-end behavior

If you would like, it would be great to create a separate a2a-redis Python Library which includes your extra modules.