Decrease amount of requests and handle timed out operation

Question

Decrease amount of requests and handle timed out operation

Closed this issue a year ago · 5 comments

The amount of requests that are fired out are disproportional to the actual changes that happen to a Polarion project. For example through one pipeline run we had over 4177 requests which lead to only 3 changes.

In order to improve the situation, we can do the following things:

Split up the model_elements sync command into CRUD of work items and CRUD of work item links
Add and use a checksum attribute on work items to calculate relevant requests (@micha91's idea)
Instead of firing out GET requests repeatedly to get a work item map, do it once and maintain changes (CRUD) in this mapping; Then writing this mapping to a file and using it as an artifact on consecutive commands.

Multiple pipelines run in parallel can effect each other by time outs. For handling this error, the REST API client can help with the following strategies:

on receiving a time out exception in httpx, backoff and wait for a moment to retry the request
push failed requests to a queue and retry them later on batched requests
control the speed and frequency of requests with rate limiting

Summoning @Wuestengecko @vik378 and @micha91 for your additional input.

Answer 1 · 2023-08-17T10:36:49.000Z

My thoughts about this (which I also outlined in today's daily):

Add and use a checksum attribute on work items to calculate relevant requests

IMO the checksum approach would be a good idea for the diagrams, but maybe not so much for general-purpose work items. For diagrams it's 100% obvious on what needs to be checksummed (the image file). Other WIs might present more of a "moving target", which could either easily lead to false-positive checksum mismatches or a complicated and therefore error-prone hashing algorithm – especially in the face of config updates (which, admittedly, should happen rather infrequently).

Since non-diagrams should (to my understanding) be a lot smaller than diagrams (as in: a couple hundred bytes to a few KB vs. a couple hundred KB to a few MB), it should be viable to simply download their current content and compare them as dictionaries. If the server supports it, we could even extend this to compare key-by-key and only send the fields that actually changed as a PATCH request, which would reduce request sizes even further.

Instead of firing out GET requests repeatedly to get a work item map, do it once and maintain changes (CRUD) in this mapping; Then writing this mapping to a file and using it as an artifact on consecutive commands

Not a fan. This is a lot of complexity that we'd load onto ourselves to create and maintain such a mapping, and IMO it's also a relatively fragile solution (what happens if something changes on Polarion that we aren't aware of?). If we can, I'd rather avoid this option.

Multiple pipelines run in parallel can effect each other by time outs

My vote goes to (randomized) back-off-retry. Not only on request timeouts, but also for temporary server-side errors, like 500 and 502-504.

If we're batching requests already anyways ^{(are we? I might not be up to date on that)}, pushing failed requests to a queue for later won't change the request content anyways, so it's really just back-off-retry with extra steps.

Rate limiting (whether server or client side) is IMO only a good idea if it's a concern that the server stays responsive for interactive users during the automated sync. We can pretty much ignore this if pipelines only run during the night anyways, but it might be something to look into for on-demand pipelines that can happen during a day as well. Still, I'd say we only look into this if interactive use actually starts to be impacted by the pipelines, and not any earlier. If there's no problem to begin with, it would only be an unnecessary slowdown.

Answer 2 · 2023-08-17T14:22:11.000Z

IMO the checksum approach would be a good idea for the diagrams, but maybe not so much for general-purpose work items. For diagrams it's 100% obvious on what needs to be checksummed (the image file). Other WIs might present more of a "moving target", which could either easily lead to false-positive checksum mismatches or a complicated and therefore error-prone hashing algorithm – especially in the face of config updates (which, admittedly, should happen rather infrequently).

I agree with your opinion. Even though it's not the worst case if a false-positive checksum happens (i.e. mismatch) because then we'd just do the PATCH request and the REST API would notice that the description didn't change. IMO it's far worse if we identify a false-positive match such that no PATCH request is done, even though some other custom fields on diagrams changed (looking at the new reverse grouped links custom fields). I will get back to this in my closing paragraph.

Since non-diagrams should (to my understanding) be a lot smaller than diagrams (as in: a couple hundred bytes to a few KB vs. a couple hundred KB to a few MB), it should be viable to simply download their current content and compare them as dictionaries.

Instead of really comparing dictionaries (a problem we solved in the past btw) I'd rather go for hashing the response content and compare these. (KISS) It is faster and less complex.

If the server supports it, we could even extend this to compare key-by-key and only send the fields that actually changed as a PATCH request, which would reduce request sizes even further.

The REST API allows that. We use that since day one.

Instead of firing out GET requests repeatedly to get a work item map, do it once and maintain changes (CRUD) in this mapping; Then writing this mapping to a file and using it as an artifact on consecutive commands

Not a fan. This is a lot of complexity that we'd load onto ourselves to create and maintain such a mapping, and IMO it's also a relatively fragile solution (what happens if something changes on Polarion that we aren't aware of?). If we can, I'd rather avoid this option.

I'm also not a fan of this, but this is afaik the only solution to reduce the amount of GET requests, even though these are the ones we can actually batch (also POST, but not PATCH), and enrich with additional content via the include parameter. So these are probably not the cause of the problem here (!?). RE fragility: I think a good process would be to have dedicated sync projects where it is clear for all involved parties that the Capella-Polarion Bridge is ruling the content. Any changes by humans would be overwritten/not maintained on non-fixated work items or live docs. On projects where humans are working, references from the sync projects are used. Since we are planning on having a full round-trip bridge, humans should expect errors when they interfere this closed process. I think it would be neat, if it doesn't matter where humans interfere. So changes to Polarion work items are synced to Capella model objects and vice versa. And this completely wipes the idea of the mapping.

Multiple pipelines run in parallel can effect each other by time outs

My vote goes to (randomized) back-off-retry. Not only on request timeouts, but also for temporary server-side errors, like 500 and 502-504.

(Randomized) back-off-retry, so you mean retrying the request in randomized frequency till it was successful? Obviously with a limit. I don't know if I really get your idea here.

If we're batching requests already anyways (are we? I might not be up to date on that), pushing failed requests to a queue for later won't change the request content anyways, so it's really just back-off-retry with extra steps.

There are endpoints on the REST API which allow for batching. Updating work items (PATCH) is not supporting batched requests, so we are doing these for each work item individually. So in solving the problem I'd start with filtering the updating via the checksum comparison and then see if the problem persists.

Rate limiting (whether server or client side) is IMO only a good idea if it's a concern that the server stays responsive for interactive users during the automated sync. We can pretty much ignore this if pipelines only run during the night anyways, but it might be something to look into for on-demand pipelines that can happen during a day as well. Still, I'd say we only look into this if interactive use actually starts to be impacted by the pipelines, and not any earlier. If there's no problem to begin with, it would only be an unnecessary slowdown.

Yes, you are right... but it is also what some users want: Manual triggering of pipelines.

Upshot

We store a checksum custom field only on WIs of type diagram. This checksum is just about the description (i.e. the embedded SVG). Since we split synchronization commands, to stay sane and not mix too many things together, into:

sync of diagrams (CUD)
sync of model-elements
sync of work item links (CUD) (this is still included in 2., we want to split that into its own, separate command)
sync of grouped links attributes (CUD)

(Yes these commands should be executed in this order, but don't have hard dependencies on each other. If there are no WIs of type diagram you can't expect diagram_reference links to model-elements from 3. Similarly 4. requires links, which are handled by 3.)

Only updating can't be batched and therefore a comparison via checksum is done. The checksum/hashing depends on the work item type if we update WIs. WorkItemLinks are never updated. If we have new links > we create them. If we have dead links > we delete them.
Does everyone agree on this strategy? @Wuestengecko @micha91 @vik378

Answer 3 · 2023-08-17T16:25:53.000Z

You weren't there this morning, so let me briefly make sure we're all on the same page about this.

@micha91 explained that the current implementation makes some 4000-ish requests against a project containing about 2.5k work items, only to update 3 of them in the end.

You explained:

Instead of really comparing dictionaries, I'd rather go for hashing the response content and compare these.

If the server supports it, we could [...] only send the fields that actually changed as a PATCH request.

The REST API allows that. We use that since day one.

Assuming a bug-free implementation (big assumption, I know ;) ), this would mean that, of those 4000 requests, three of them were actually PATCH requests.

But if that's correct, this next part doesn't make sense:

There are endpoints on the REST API which allow for batching. Updating work items (PATCH) is not supporting batched requests, so we are doing these for each work item individually. So in solving the problem I'd start with filtering the updating via the checksum comparison and then see if the problem persists.

So what is going on now? What actually are those requests that are being made? I'm seriously confused right now. Should I just go and read the code? :D

(To make it clear, I definitely do agree that we need to do something to get down from those 4k requests. I just don't know what currently.)

(Randomized) back-off-retry, so you mean retrying the request in randomized frequency till it was successful?

Randomized as in, we wait a random number of seconds (say, between 5 and 15) instead of having a fixed back-off timer of e.g. 10 seconds. This should minimize the chance of two parallel pipelines continuing at the same time and just stepping on each other's feet over and over.

I'm fine with just having a single retry at first, if that makes the implementation simpler :)

Rate limiting (whether server or client side) is IMO only a good idea if it's a concern that the server stays responsive for interactive users during the automated sync.

Yes, you are right... but it is also what some users want: Manual triggering of pipelines.

This doesn't contradict: If these manually triggered pipelines don't bring down the server, there's no problem that we need to solve. We can talk about this all day; in the end we'll have to see how well it runs in prod.

We store a checksum custom field only on WIs of type diagram. This checksum is just about the description (i.e. the embedded SVG).

Sounds good. Exactly what I had in mind.

What about the other (non-diagram) work items though? Do they get a checksum now as well (like you described that you wanted to do), or do they not?

Since we split synchronization commands, to stay sane and not mix too many things together, into: [...]

(Yes these commands should be executed in this order, but don't have hard dependencies on each other.)

This is unnecessarily confusing. How about making a single CLI entry point with a few enable/disable flags, which will call the requested operations in the correct order? Although I suppose a CI script comes reasonably close to that, if it's only intended to run in CI anyways.

Answer 4 · 2023-08-17T16:34:04.000Z

Where do the requests come from

I want to first ensure that we are all on the same page here:
If we have n WorkItems in Polarion and m elements/diagrams in Capella we are currently having n/100 requests for getting these WorkItems. Currently we are also performing n PATCH requests whenever we do a sync. In addition, we are also performing between m-n and (m-n)/100 POST requests to create new elements (depending on the body size). IMO we should focus here on reducing the amount of PATCH requests as GET requests don't really make the difference here. For this reason I would not go for the local WorkItem caching approach as it does not save many requests, can be error prune and would be a high work effort.

When it comes to linking WorkItems we have an additional big problem with the amount of requests. If we want to get the current links of a WorkItem we have to perform m requests to do that, because the API is meant to be used per WorkItem here. To be honest: I don't really have a solution here now, I see two possible approaches and both have downsides:

We could use the WorkItems API and include the LinkenWorkItems here. I tried that already and wasn't able to figure out, how to set the fields parameter to get them in the WorkItem's relations, but if I set fields to @all it works. Downside: There is a limit for related objects in responses.
We could work with the fields @ewuerger is currently working on instead. These are also representing the links of a WorkItem, and we could just generate them on our side, compare them with the data from the WorkItem and update the Links based on that.

How a more efficient process could look like

I would prefer to have checksums for all WorkItems. If we create a checksum for a WorkItem, we could integrate this directly into the WorkItem class. As all attributes, we may introduce in the future, will go into the additional_attributes dict, we can easily create this checksum without having to care about future, additional fields - they will be respected automatically in the checksum. This way we could go for the following approach:

Get the mapping from Capella UUIDs to Polarion WorkItem IDs.
We identify, which WorkItems have to be newly created and post them
Again get the mapping from Capella UUIDs to Polarion WorkItem IDs. In the same get_all_work_items request, we request the checksum of the WorkItem from Polarion, too. This way, we don't have to get all attributes of all WorkItems here, which result in higher traffic here. In addition, we would have to add newly added attributes to the list of requested fields, whenever we introduce a new custom field. We could also use the @all keyword for the fields here, but this is highly unrecommended in the documentation.
We create the WorkItems from our Capella elements/diagrams
We identify all links and add them to the WorkItems custom fields (the ones @ewuerger is currently working on)
We are now able to calculate the checksums locally and start comparing them with the ones, we got from the API
For those WorkItems, who need an Update, we get the WorkItems one by one individually with all fields, including their relations (this means also LinkedWorkItems). We identify the difference and delete or create links if needed, create a patch for the WorkItem and send a patch request for the WorkItem

Splitting the job

As this would create a dependency between updating (delete/create) links and updating WorkItems, we would not be able to split this into multiple jobs anymore. But we could still split the jobs between elements and diagrams or something like that.

Retrying failed requests

Although reducing the amount of requests will already reduce the risk of failed requests during the job execution, I would also like to implement something here, too. I don't know, if I really like the idea of queuing failed requests to retry them later, as this would be much more complicated than just waiting a few seconds and retrying it. Right now approximately 1/7000 requests seems to fail. If we are able to reduce the average amount of requests to 100, we have a chance of 1/70 that a single request fails in the job execution. If we have a single retry after a failed request, I believe we should be all good, shouldn't we?

Answer 5 · 2023-11-16T12:07:47.000Z

With the current release, this isn't an issue any longer.