Support point-in-time backups
akamensky opened this issue · 13 comments
This is a one-off tool (means it does not need to run in background after backup is done), so the reliance on background daemon process is funny. There is no need to run kafka-connect as a daemon at all.
You are right regarding the restore procedure. Restoring is an one-off activity.
The backup is a continously running activity. There is no "I finished doing a backup" in Kafka as Kafka data is a stream and there is no end to it. Sure, you can assume that if you did not get any new data for x seconds you are "done" but you cannot generalize that.
Have a look on #46 and #54 for more
#56 🎉
The backup is a continously running activity.
This assumes continuous stream of data 24x7x365, that does not apply to all cases. In our case the stream runs for X hours per day only, the backup happens only after that and is actually intended as a daily backup/snapshot of data.
I think there should be a way to (internally) detect that there hasn't been any new messages for X amount of time (possibly configurable interval) after which the backup process would gracefully exit and thus terminating the process.
Another (possibly simpler) alternative to this would be to only backup messages up to the timestamp of when backup was started. Not sure how this would play together with backing up offsets. Maybe first backup offsets, then we know the timestamp at which we backed up offsets and we can backup messages up to that timestamp.
I see your point. Yeah, probably it would be nice to have a way to do point-in-time backups 🤔
Though, this is not trivial as there is no easy way to decide whether a stream "finished".
What you can do in your case:
- Let Kafka Backup running in the background
- Kafka Backup writes data continuously in the background to the file system
kill -9
Kafka Backup as soon as it is "finished", i.e. it finished writing your data. This should be promptly after you finished producing data- move the data of Kafka Backup to your new destination.
I understand that this is quite a common use case and I will provide more documentation for that with #2. For v0.1 Documentation is the last big issue so hopefully this should happen soonish ;)
I see following approach
- #54 introduces a new standalone CLI tool. The CLI tool should support this.
- We add a new flag
--snapshot
to the CLI tool (or add a new tool calledbackup-snapshot.sh
)
How to detect when a backup is "finished" (only applicable if the --snapshot
flag is set):
- We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup
- When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data
What do you think?
Let Kafka Backup running in the background
The issue is exactly with this step. We cannot keep it running in background. We only have a specific window when we can do the snapshot. It is not up to us to decide when we can do backup it is an external regulatory requirements.
We remember the time when the backup is started. All records that have a newer timestamp are ignored during the backup
Yes, that is exactly what I meant and I think this would remove requirement of having it running in background (and trying to catch the moment when all producers are done).
When a partition does not produce any new data for some time (e.g. 20s) we assume that there is no new data
I think this option is mutual exclusive with the other one. And I think first one is better as it gives a specific reference point and does not rely on finding a window when there are no messages.
Actually I wanted to write that this is nearly impossible with Kafka, but while writing I got an idea for a solution:
The kafka-consumer-groups
returns the current position of the consumer in the partition, but more interestingly it returns the current end-offset of each particular partition. This means there is a way to get the latest offset for a partition at a certain point in time. I have currently no idea how this is achieved (need to check the code).
So now there is a clear path how to do a (more-or-less) point-in-time backup:
- Get the end-of-partition offset for every partition to be backed up (Somewhere here: https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L81 )
- Consume every partition
- As soon as a consumed record has a offset
>=
the saved one for that partition remember this. Ignore all records in the backup. (See https://github.com/itadventurer/kafka-backup/blob/master/src/main/java/de/azapps/kafkabackup/sink/BackupSinkTask.java#L63 ) - As soon as all partitions are up to date print a message to STDOUT
- Use the wrapper script to detect this message and kill kafka connect gracefully. Similar to how it is solved during restore: https://github.com/itadventurer/kafka-backup/blob/master/bin/restore-standalone.sh#L232-L252
You see that this is really not that trivial.
My current focus is to improve the test suite and stabilize Kafka Backup for a first release (See https://github.com/itadventurer/kafka-backup/milestone/1). I cannot give you an ETA for that feature. I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) ) I am happy to help if there are any questions
You see that this is really not that trivial.
I am more on a operations side of things (like setting up, monitoring Kafka clusters etc). So I trust you on this part. My point being is that from my side of work this is something me (and pretty sure many others) do need.
I would be more than happy to review a PR for that (and I am also searching for additional maintainers ;) )
I am not that great with Java/Scala to be of much help here. If it were Python, C/C++ or at the very least Go I could help :P
Hello!
at first - I'm happy to found your solution, because i have to backup kafka topic data
at second - unfortunately I couldn't write anything in Java/Scala, so I've prepared 'wrapper' for you 'backup-standalone.sh' with python for full backup solution
https://gist.github.com/FloMko/7adf2e00cd80fe7cc88bb587cde999ce
It'll be nice to see any updates about any build-in support for point-in-time backup
Hey,
Thank you for your work! As a temporary workaround I could imagine to add this as an additional script to this repo and replace it later by a built-in solution. Feel free to add this as a Pull request :) (to ./bin/kafka-backup-point-in-time.py
or something else ;) )
I am about to publish completely separate implementation written in Go that doesn’t rely on connect API. Just FYI. We are already using it in our production environment.
@akamensky could you share your solution? as far as you have tested your solution it'll be fine
Thank you @WesselVS for your PR #99! I have just merged it to master. Will do a release with this enhancement and some other fixes soonish.
@akamensky Cool! Great to see some more work regarding Kafka Backups ;)