Publishing needs to be fully synchronous
Closed this issue ยท 10 comments
Okay, so looking through some of the problems here, publishing really needs to be a fully synchronous operation that does not return until all indexes are synced. See
Line 91 in ab468b1
The reasons this needs to be fixed are numerous:
- The fact that it's not is causing all kinds of problems with the tests. If the expectation from the folks writing the tests is that it's a synchronous operation, and they know how it's implemented internally, then user expectations are likely to be the same. This has the potential to cause difficult-to-trace issues in any other project using
go-ssb
components in the future. - Publishing a second message to the publish log by the same author requires that the previous message be indexed (see #292). Which means unless publishing is synchronous, we'd be relying on the caller to ensure that all indexes are synced before calling publish, and everyone using the publishing mechanism needs to know this and do this every time they go to publish. This is (a) difficult, or (b) impossible to ensure, especially in a multithreaded environment.
- There really needs to be a mutex on the log so that no new publishing can happen while waiting on the indexes to catch up. This needs to be done in one place. Within the publish log seems to me to be the most logical place to implement that.
This would fix #292 as well as #289, and probably several of the other flaky tests in #237. This is core enough functionality and broken enough to cause a lot of issues to due race conditions pretty much everywhere, especially in the test suite where it often publishes a bunch of messages in quick succession.
Thanks for raising!
publishing really needs to be a fully synchronous operation that does not return until all indexes are synced ... There really needs to be a mutex on the log so that no new publishing can happen while waiting on the indexes to catch up.
It would certainly makes things easier to reason about in the code. But then a working implementation of index syncing would probably also be workable. We're struggling to put that together atm but it's not unreasonable to believe it can be figured out.
One concern I have is that indexing takes a looonngg time, especially when replicating several GB from scratch? This could significantly slow things down? I'm not sure though, because the indexing of one message at a time is potentially negligble?
I think the original intent was to synchronise concurrent processes via mutex locks/unlocks which would allow message appending to continue unhindered by indexing? But looking up messages via index would then have to wait. I think that is the standard approach from my observation of clients like Patchwork.
This still seems like a legit approach, it's just tricky to understand how to make it work (have messaged cryptix for advice, thoughts Coming Soon โข๏ธ).
Are you suggesting that this approach is not feasible @KyleMaas?
we'd be relying on the caller to ensure that all indexes are synced before calling publish, and everyone using the publishing mechanism needs to know this and do this every time they go to publish.
For the client caller logic, we could always re-work stuff to make the "wait for index sync" more explicit. Either adding flags to publishing or making the MUXRPC return early while stuff is still running. I'm not sure what is best. I think might be better to discuss separately to fixing this sync or async question tho.
publishing really needs to be a fully synchronous operation that does not return until all indexes are synced ... There really needs to be a mutex on the log so that no new publishing can happen while waiting on the indexes to catch up.
It would certainly makes things easier to reason about in the code. But then a working implementation of index syncing would probably also be workable. We're struggling to put that together atm but it's not unreasonable to believe it can be figured out.
One concern I have is that indexing takes a looonngg time, especially when replicating several GB from scratch? This could significantly slow things down? I'm not sure though, because the indexing of one message at a time is potentially negligble?
I think the original intent was to synchronise concurrent processes via mutex locks/unlocks which would allow message appending to continue unhindered by indexing? But looking up messages via index would then have to wait. I think that is the standard approach from my observation of clients like Patchwork.
This still seems like a legit approach, it's just tricky to understand how to make it work (have messaged cryptix for advice, thoughts Coming Soon tm).
Are you suggesting that this approach is not feasible @KyleMaas?
If that's the desired user experience, consider the following: Let's say a user is recovering their feed and it has downloaded but indexing hasn't caught up. They publish a message. If you want to allow indexing to happen concurrently, then one of two things have to happen:
- The publishing process needs to bypass the index to find the latest message the user posted.
- The publishing process has to wait until the indexes catch up.
we'd be relying on the caller to ensure that all indexes are synced before calling publish, and everyone using the publishing mechanism needs to know this and do this every time they go to publish.
For the client caller logic, we could always re-work stuff to make the "wait for index sync" more explicit. Either adding flags to publishing or making the MUXRPC return early while stuff is still running. I'm not sure what is best. I think might be better to discuss separately to fixing this sync or async question tho.
One way you might be able to do this is to have the publishing system check if indexes are caught up. If they are, use them. If not, query the log directly to find the latest message. That way, it could not accidentally fork the feed. But that means we still need some way to figure out if the indexes are caught up, which leads back to #251. I would think, though, that that could handle the use cases you mentioned quite well at the expense of it being a little slower to publish new messages until the indexes are caught up. But at least that way you could guarantee clean operation without race conditions.
That make sense?
One of the other problems I see with handling caller to do index synchronization outside of the publish mechanism is that it breaks the principle of encapsulation. The caller really shouldn't be relying on the implementation details of what happens inside of the publishing system. It should be able to be treated as a black box, with implementation details opaque to the caller.
As for doing publishes asynchronously, that makes testing really difficult, and what happens if there is a crash or some kind of integrity violation between the time the caller asked to publish and when publishing was meant to happen? From a program perspective, it's something that could happen, but from a user perspective what happens if they publish something, expecting it to actually happen, and then indexing takes several hours past that? Would it be reasonable to assume that they're expect to wait until indexing finishes before their message shows up? How would they know when it's safe to close the app/program? In my mind, it adds a lot of needless complexity for something that a user cannot reasonably expect to work that way, and would very much increase the burden on the UI layer to handle all the error modes. If you were building an app around this, you'd almost need to implement an Outbox kind of thing so if the user wanted to close the app they wouldn't lose everything they've done so far if indexing wasn't complete, which then just adds another synchronous storage layer on top of the core.
The publishing process has to wait until the indexes catch up.
Would it be reasonable to assume that they're expect to wait until indexing finishes before their message shows up?
This is the default Patchwork/Manyverse behaviour? Posting is blocked while indexing is happening tho, so you just can't post.
I think it's a great perspective to think about this from a client implementation. I'm very much up for synchronous operations and I'd like to be able to judge the compromise better. I feel a bit uncomfortable attempting to switch without understanding how it supposed to work and the design decisions around that. Sure I can read the code but I do feel like I'm missing a lot of the context around it. I'm hoping to get a mindmeld from cryptix and also boreq if time allows. Will report back / let you know what is organised so you can join.
The publishing process has to wait until the indexes catch up.
Would it be reasonable to assume that they're expect to wait until indexing finishes before their message shows up?This is the default Patchwork/Manyverse behaviour? Posting is blocked while indexing is happening tho, so you just can't post.
I believe it is. But is that what the user would expect, and is that the best it could be? If we could guarantee at the publish layer that it won't fork the feed under normal expectations, I'd see it as a huge win for user friendliness.
I think it's a great perspective to think about this from a client implementation. I'm very much up for synchronous operations and I'd like to be able to judge the compromise better. I feel a bit uncomfortable attempting to switch without understanding how it supposed to work and the design decisions around that. Sure I can read the code but I do feel like I'm missing a lot of the context around it. I'm hoping to get a mindmeld from cryptix and also boreq if time allows. Will report back / let you know what is organised so you can join.
The more I think about it, the more I think the "use the indexes if they're caught up, otherwise query the log directly" method would probably be the ideal way to do it. It would allow for the asynchronous index processing use case you mentioned while not making testing overly complicated and not breaking typical user expectations from the way posting/sending messages works on other platforms. I mean, if we've already got internal developers with that existing expectation for how it should work, and we'd have to add synchronization code to pretty much all of them, it would simplify things a ton to abide by those expectations rather than changing everything surrounding it. The only obstacle is that we'd need a way to query the index state to determine which method to use, and that comes back to #251. If that was done, doing the hybrid method like this would be trivial to implement.
"use the indexes if they're caught up, otherwise query the log directly"
Yeh, I think you're right on this. Gonna ask other ssb devs to see what the thoughts are.
The more I look at this, the more I think this is actually an upstream Margaret problem. I just did two pull requests related to this - the query one should be ready to go now and is useful regardless, but the actual fix is incomplete because after reading through Margaret's code I don't think there is a way to query the main log without running into the same kind of race condition within Margaret. So I want to try and isolate this issue a little further before attempting to fix it here because I think the problems are actually far more wide-ranging and at a deeper level than we've been looking at so far. If I can prove that it's coming from where I think it is, a simple fix in Margaret could potentially fix a whole raft of problems in one fell swoop.
@KyleMaas interesting! Cheering you on as always ๐ #302 looks legit and I see where you're going with #303. Ping me on any issues Margaret side and I'll try to think along. I've been unpacking the dev.scuttlebutt.nz docs on it for now. Will give another few days for cryptix to have a pass on #301 but if that doesn't come through, I could make a move on the merging in the weekend and ofc feedback can always be taken into account later.
@decentral1se Sounds like a plan. I'm interested to see how many more tests we can fix once we have a properly working WaitUntilIndexesAreSynced()
- there are a lot of tests with TODO items mentioning that it needs to happen, but prior to this there was really no good way to accomplish that. Be interesting to see how far this takes us toward test stability.
So, after doing a lot of reading through the code of Margaret to try and figure out how indexes and queries work, I have to revise my statement of "doing the hybrid method like this would be trivial to implement". I'm just not finding any way to query Margaret without either (1) querying an index, or (2) stepping through every message past the point where the indexes are to reach the end of the log. So, with that said, I think we need to go with the "Posting is blocked while indexing is happening tho, so you just can't post." route for now and maybe revise that at a later date. One way or another, the call will be synchronous and should fix a bunch of the test flakiness problems.
At least now if someone builds a GUI program using go-ssb
as a base, there will be a call they can use to make sure the entire program doesn't block when someone tries to publish a new message.