gocardless/statesman

Bulk Transition Feature Request

skatenerd opened this issue · 5 comments

I am wondering if it makes sense to have some sort of "bulk transition" or "bulk trigger" functionality, where I can supply a big list of models (or IDs), and an event, and get the behavior that would come with mymodels.each { |model| model.transition(from: :foo, to: :bar) }.

I think this is a use case many people probably have, and one which has pretty slow performance. I am hoping there's a way to transition many records, using fewer queries. I understand that this might not be possible, with concurrency concerns, etc. I am also totally open to having a significantly restricted API for the "bulk" version of a given task.

Anyway, please let me know if this seems realistic, or if there's a recommended workaround that I might use to do something like this in my own app!

I guess this is going to be tough since Rails 6 is the first version to offer bulk-inserts.

I've been thinking about what an "optimistic" approach to concurrency would look like here. Basically, look up all of the sort-keys of the existing most-recent transition records. Then begin a single transaction where the new transitions are inserted, and the relevant most_recent flags are updated. This way, if someone moves the DB into an unexpected state under my nose, my whole transaction will fail due to uniqueness constraint on the sort-keys.

Alternatively, I could actively lock/select for update those most-recent transition records, which would keep anyone from executing any harmful transitions under my nose.

Hey @skatenerd, thanks for your suggestion, this is something that I've thought about in the past, I know @lawrencejones has also, but neither of us have found the time to dig further.

Also as you rightly pointed out there are a few things to consider, but it is definitely achievable and would be hugely valuable to many users (GoCardless included haha). Another potential complication are that recent changes have meant that MySQL users create transitions in a slightly bespoke way, meaning this would likely have to have be implemented in two different ways. Callbacks will also need some careful consideration.

In short, it's definitely realistic, the primary blocker is unfortunately finding the time. That said I will aim to dedicate some time to give this more thought soon (even if that means a hacky POC) though I don't want to make any promises 😬 .

One note with regards to callbacks, I was thinking maybe we could just expose a lower-level API here. It's kind of more tricky, as introduces new questions (like, what about when individual transitions in the bulk job are illegal), so it might make sense to build in more restrictions as to what it would do.

I don't have free time to help with this these days but I'm asking my employer if they might let me do this on company time.

Hey @skatenerd, thanks for opening this issue.

I agree entirely that this is complex. I've considered this before, and some of the roadblocks would be:

Postgres conflicts, ie. unique index violations

We use Postgres constraints as a form of concurrency control right now, so that no transition could race another and try taking the same resource into two different states. For us to get the benefits of bulk transitioning, we'd need to handle those conflicts via the Postgres command that attempts the transition, so via an ON CONFLICT DO NOTHING or perhaps some other alternative action.

Any alternative would mean our bulk transition API would fail even for a single error, which might not be what we want.

Transitions

State transitions are the reason I decided not to look at bulk transitioning, at least for our use case. As transitions are pure Ruby, they can do all sorts of things that we can't push into the database layer in a bulked fashion.

The majority of the time spent transitioning our objects is spent in transitions, so unless we can somehow batch the transition steps, the speedup we get from bulk transitioning is limited.

Of course, this might be different for you. There may also be an interesting middleground where we support bulk transitioning for state machines which define transition hooks in a form that can be pushed into a SQL dialect, and processed in bulk inside the database. That's an interesting proposition for sure.

Lock lifetimes

If we're bulk transitioning, we'll hold locks for at least as long as the bulk update takes to process. This might hurt for OLTP systems that require locks are short-lived, say a system serving an HTTP API where people expect API requests to translate into statesman transitions.

Longer transitions could mean we hurt API performance. Fwiw, we've compromised in this area already by wrapping our high throughput transition code in small transactions, held for about 3-5s. It gives us a performance boost that we're happy to exchange for the odd request taking ~3s longer, if it happens to collide with the work. But any bulk API would have to make this compromise clear.


These were my thoughts on the issue. As with all performance chasing, it would be good to get an idea of where you're spending your time right now.

Is it in the transition! work, where we're actually getting, inserting and updating the transition rows? Is that down to network latency, or is it the database genuinely doing work? We might be able to improve life more significantly by converting the Ruby code that performs transitions into a stored procedure, if network latency is the bulk of the time spent.

Do you have any of this data on hand?

As an addendum, the constraints we've created via Statesman transitions are IMMEDIATE, rather than DEFERRED. I've always wondered if we can squeeze something out of deferring the transitions- it makes no sense, for example, that the transition transaction doesn't defer the unique constraint on most_recent until it's updated all the relevant rows. This causes us to execute more queries than we might need, to ensure we set most_recent=false on the original transition before setting it to most_recent=true on the newer.

I imagine any work on bulk transitioning would benefit from an awareness of this.

Thanks for all of the thoughtful responses.

The majority of the time spent transitioning our objects is spent in transitions

What does this mean exactly? Are you referring to the before/after hooks?

I don't have data on latency right now, but if I take 1000 models and arbitrarily transition them (mymodels.each { |m| m.state_machine.transition_to!(:foo) }), with no hooks or anything, it can take upwards of 60 seconds. There is significant variability, at least in our DB, in the update user_transitions set most_recent=false where ... calls - they go from less than 1 ms all the way up to ~80ms, depending on how many transitions already exist for that user. This may just be a problem with tuning my specific database, but it orders of magnitude faster (~500ms) for me to say something like MyTransition.where(user: User.in_state(:happy).limit(1000), most_recent: true).update_all(most_recent: false)

As for long lock lifetimes, this is just speculation, but an optimistic approach may be the way to go. IOW, having an approach that works happily if no concurrent transitions happen on the selected models, but which fails or lets the user try again in the case of contention. I really don't know what semantics would make the most users happy here, though.

Perhaps a practical thing to do would be to build some bulk-transitioning functionality into my own app, and if it works well for my team's needs, then I'll present it here and see what people think of the semantics/API/etc. This is going to be really tough to do in a database-agnostic way.