maxpert/marmot

CDC as a feature with multiple sinks

JayJamieson opened this issue · 19 comments

I've had a good look at what you've done with this project and I like it a lot. I've been thinking of ways to add CDC to SQLite3 and this abstraction is really nice and simple.

I've managed to "hack" in a CDC event pusher to a http endpoint as an example, but I don't know a lot about replication protocol and found that in multi nodes the CDC events are publish for N number of nodes.

Wondering if you are open to implementing CDC replicating functionality similar to Debezium. Namely the event dispatching to different sinks.

I wonder if instead of building multiple sinks I can just produce a log file, which you can tail onto and then you can use more mature tools like Benthos to decide where it should go. That way in Marmot remains modular and you can use more sophisticated dedicated for this task to build really complex pipelines.

Could work, Benthos seems simple enough to configure along side. I guess when writing to log file we would only want events accepted by the cluster right?

In anycase, thanks for hearing my suggestion. Will keep an eye out for it if you want help I can try.

One very good suggestions that came up in one of other discussion forums, and which I think make sense is integration with NATS people usually used NATS or Kafka for have these stream of changes to perform additional actions. Implementing log is really simple, implementing NATS may require some additional work.

I started digging into NATS and the more I dug in the more I realized that it fits like glove for what I am doing in Marmot. So one of complex moving pieces I had for Marmot was Raft itself, and having a persistent change log stream across the nodes. I couldn't hold my excitement to delete the code a build a version that fully works on top of NATS, I checked it into a branch as this will be a major version bump breaking compatibility with version 0.3.x

While I am making testing it, you can checkout the nats branch here. Thanks for suggestion this really changes the architecture of Marmot itself.

Tested and verified. I am successfully able test and distribute load over multiple streams, and it's working flawlessly over multiple regions as well. I would cleanup the docs, introduce additional material get this ready for 0.4.x GA version. Seems like NATS is gonna be permanent part of Marmot now.

image

@maxpert wow, this is amazing. Really appreciate the effort, i'll keep an eye out for the release.

@JayJamieson v0.4.4 is already out and supports NATS. Checkout new demos on main page as well.

About Sinks using the CDC events .

I assume they are in a nats queue or similar, so why not just allow others to use NATS to build their Sinks and hence materialised views

pranadb does exactly that . It’s golang and uses Kafka but is open to a NATS driver.

Mira uses NATS to do it’s data distribution Just like you do !!

you can then query via standard SQL.

Check it out :
https://github.com/cashapp/pranadb/blob/main/docs/demo.md

About Sinks using the CDC events .

I assume they are in a nats queue or similar, so why not just allow others to use NATS to build their Sinks and hence materialised views

pranadb does exactly that . It’s golang and uses Kafka but is open to a NATS driver.

Mira uses NATS to do it’s data distribution Just like you do !!

you can then query via standard SQL.

Check it out : https://github.com/cashapp/pranadb/blob/main/docs/demo.md

You can do that today already. I am in process of documenting JetStreams and Subjects that you will need to hook on. Maybe I will do a demo on for that as well. But if you checkout logstream/replicator.go you will already know JetStreams, topics, and serialization format.

@maxpert sound nice !

benthos already does CDC on CockroachDB btw, so we could use that as a reference / template for the marmot benthos plugin . Exciting stuff

redpanda-data/connect#971

So I ended up writing a sample Deno script that can actually demonstrate the idea of having a subscription listener. It's literally 1 file with less than 200 LoC (loads of configuration). But you gave me very nice idea! Benthos has NATS as input so maybe I can do a demo video of changes from PocketBase getting replicated to ElasticSearch or Cassandra!

That was the idea . Use benthos as a stream transformer into other systems.

benthos also has a GUI called Benthos Studio where you configure it - it’s closed source. But the integrated CueLang , which is in Benthos, is the driver of the GUI. This is possible because benthos and CueLang both compile to WASM.

CueLang is written in golang

I would be happy to help on this effort if you want . Maybe create a branch . I use CueLang a fair bit.

Btw checkout pranadb as an alternative to Cassandra. It’s a materialised view of any db using Kafka. I am extending it to work with NATS.

https://github.com/cashapp/pranadb

Also checkout Zinc as an alternative to ES.

https://github.com/zinclabs/zinc

Both are pure golang and use less resources. Also they are able to be embedded , rather than having to be micro services.

these 2 "services" complementary to pocketbase.

Absolutely possibilities are limitless I love the enthusiasm here. As I've said many times, contributions are welcome. I selected NATS for two reasons:

  • JetStream and the built in raft distribution per stream, allowing me to essentially offload the transaction logic stream.
  • Amazing tooling and performance, I benchmarked it and I have used Kafka I can tell you it's way simpler to operate and way faster.

Now it can integrate with benthos if somebody really wants to, but I won't take benthos as first tool. Because then I am dependent on user making the right choice persistence of change log, and how am I do snapshots and replay them with something which is generic relay and no concept of sequence number. So I will stick to NATS for change logs. Will provide more options for snapshot save/restore (MinIO, S3, BB etc.) and this tool can become part of ecosystem.

@maxpert @JayJamieson

Here is CDC of Postres using benthos.

https://github.com/disintegrator/benthos-pg-listen

it works really will and has nice examples


I am also working on zinc integration with bleve and marmot.
https://github.com/zinclabs/zinc

cause then you can index anything.. images, files, db.
In fact benthos-pg-listen cdc is exactly what you want to integrate if you want to update a search system like zinc.

  • get event from db cdc --> update zinc index.
  • spread with marmot
  • now users everywhere can search for any type of data.

Absolutely possibilities are limitless I love the enthusiasm here. As I've said many times, contributions are welcome. I selected NATS for two reasons:

  • JetStream and the built in raft distribution per stream, allowing me to essentially offload the transaction logic stream.
  • Amazing tooling and performance, I benchmarked it and I have used Kafka I can tell you it's way simpler to operate and way faster.

Now it can integrate with benthos if somebody really wants to, but I won't take benthos as first tool. Because then I am dependent on user making the right choice persistence of change log, and how am I do snapshots and replay them with something which is generic relay and no concept of sequence number. So I will stick to NATS for change logs. Will provide more options for snapshot save/restore (MinIO, S3, BB etc.) and this tool can become part of ecosystem.

I also agree on this in that NATS for change logs makes it much cleaner.

Maybe a benthos hook from marmot into benthos is another approach for people that want to augment the events in various ways ? So marmot because a first class benthos plugin.

  • so then all your nodes running marmot get the event and do local processing.

closing as NATS provides enough of what I was originally after.

Would love to hear your use case, as I collect them. Feel free to drop me message on Discord.