mavlink/mavlink-devguide

Mission upload/download clarifications

julianoes opened this issue · 11 comments

@hamishwillee I was staring quite a while at these diagrams here today:
https://mavlink.io/en/services/mission.html#uploading_mission

And I have some questions that you might have thoughts on:

  1. We describe the timeouts on the ground station side. What about the drone side? As far as I know the implementation we also have timeouts active on that side. Do we just omit these for clarity in the diagram or are the specs actually that the autopilot should not have to keep track of timeouts? (I think it probably should, otherwise it would not even know if something failed/timed out altogether).
  2. What is supposed to happen if the last MISSION_ACK is lost on the link? Should we re-send the last MISSION_ITEM_INT until we get the ack? Or is a lost ack just the end of the world, and we have to redo everything?

Hi @julianoes

  1. We describe the timeouts on the ground station side. What about the drone side? As far as I know the implementation we also have timeouts active on that side. Do we just omit these for clarity in the diagram or are the specs actually that the autopilot should not have to keep track of timeouts? (I think it probably should, otherwise it would not even know if something failed/timed out altogether).

Yes we should, and if @LorenzMeier reviews #157 we will - this is the PR I asked him to look at in the last few dev meetings. You can see the updated docs for this rendered in https://hamishwillee.gitbooks.io/ham_mavdevguide/content/v/mission_ap/en/services/mission.html#uploading_mission

  1. What is supposed to happen if the last MISSION_ACK is lost on the link? Should we re-send the last MISSION_ITEM_INT until we get the ack? Or is a lost ack just the end of the world, and we have to redo everything?

What should happen (logically) is that the GCS waits for some timeout and then gives up, resetting itself.

Drone side the system knows it is complete and resets after sending the ACK. The GCS definitely won't send another MISSION_ITEM_INT unless requested (this is already a response, not something requiring ACK).

GCS side is not documented. Looking at the QGC code the system will eventually timeout: https://github.com/mavlink/qgroundcontrol/blob/master/src/MissionManager/PlanManager.cc#L203
What confuses me is though is that the transaction is not cancelled so I think that the GCS remains in the state of "in theory" waiting for next message.

So there won't be any more messages or errors sent by either side other than the timeout, but I do wonder if QGC still tells the user it is uploading items.

@DonLakeFlyer can you clarify this point - how QGC will return to a good state following mission upload if it does not receive the final ack?

Interesting, this would mean I need to change the SDK implementation yet again.

So, I summarize this: we're out of luck if we don't get the ack and have to report a timeout.

And thanks @hamishwillee !

So, I summarize this: we're out of luck if we don't get the ack and have to report a timeout.

Yes. Note though that at this point you know you've sent the last item, and if you don't get either the ack or a re-request for the last item in a reasonable timeframe you should be able to reset the state to "idle/mission upload complete". Ie I am sure that QGC must be in idle, I just don't know how it gets there.

Note though that at this point you know you've sent the last item, and if you don't get either the ack or a re-request for the last item in a reasonable timeframe you should be able to reset the state to "idle/mission upload complete".

In that case the ground station has no idea the state of the mission on the vehicle. If QGC doesn't get an ACK after multiple retries of pushing the item up. It fails the upload with an error. It doesn't matter if that happens on the first or the last item. Same result. You can't recover from that case.

@DonLakeFlyer - The code doesn't appear to show QGC resending the final MISSION_ITEM_INT again if it doesn't get the MISSION_ACK - am I missing something? I can see that the vehicle would resend the final MISSION_ACK again if QGC it did send a MISSION_ITEM_INT after the drone returned to idle (ie it assumes the ACK was lost).

I guess that doesn't change your point.

@julianoes Don makes a good point - until you get that MISSION_ACK you don't know the state of the upload. So the correct thing to do would be to resend the last MISSION_ITEM_INT - the vehicle will respond with that ACK if it is in idle state. If you still don't get an ACK after ending the retry cycle you fail the upload.

The code doesn't appear to show QGC resending the final MISSION_ITEM_INT again if it doesn't get the MISSION_ACK - am I missing something?

You're correct, I'm wrong. QGC will just fail at this point. I'd have to think a bit as to whether it's feasible or makes sense to resend.

@DonLakeFlyer

I'd have to think a bit as to whether it's feasible or makes sense to resend.

Feasible, but possibly pointless. That is because you won't know the status of the upload either way - ie whether the PX4 side completed or errored out it will move to IDLE state, and if it gets a resent MISSION_INT then it will return the MISSION_ACK with "success".

I guess if you sent MISSION_INT before the drone could timeout due to lack of connection you would be able to assume it was back in idle because the mission completed properly. Feels flakey, but better than resending the whole mission.

This implies that if knowledge of the drone side from the GCS is important then a full reupload is required (as Julian indicated). You might be able to modify the protocol, but it can't be done otherwise.

Either way you definitely do need some timeout there to exit the state where you're just waiting forever.

The problem here is that the design is driven by the drone - and only cares about robustness of the drone side. This is the mirror of the download process where the GSC is robust and the drone doesn't track the download particularly (it does have a protocol level timeout to eventually put itself back to idle if messages aren't received). But here you want to know for sure GCS side that things worked.

Ie What I think PX4 thinks the protocol looked like is this:

image

The drone side is robust - it queries until it gets what it needs or gives up.
QGC should only resend on request. If it doesn't get a request midway through it should eventually timeout due to broken connection or unknown error.

If it gets to the last message and doesn't get a MISSION_ACK or another mission item request then it probably should fail after timeout.

@meee1 @WickedShell @DonLakeFlyer We discussed GCS-side mission upload in the devcall yesterday.

For those not in the in the loop on this discussion, the mission upload protocol is as below:

The idea is that if something makes a request it waits on a response. The process is robust for the drone - it always knows whether it has a complete mission or not. The problem is that it isn't robust for the GCS.

  • what it should GCS do if it doesn't get the MISSION_ACK
  • what if it doesn't get a further request for a mission item and is left in the mission upload state (e.g. due to vehicle flying out of range during upload).

Currently:

  • QGC currently has timeout on every mission item request waiting for next one. Just times out if it doesn't get final ACK
  • ArduPilot sends MISSION_ACK multiple times - GCS expected to ignore repeats of the message, and and if it doesn't get any, assumes that upload was invalid.
  • PX4 will resent the MISSION_ACK if it gets a MISSION_ITEM when in idle (it assumes that the GCS has not got the MISSION_ACK and is rerequesting it).

Do you guys have any thoughts on the best GCS-side process?

My leaning is that:

  • GCS should have timeout on last item and resend MISSION_ITEM if it does not get ACK. If it still doesn't get ack it assumes mission upload failed.
  • GCS probably should have a timeout on next mission item request or ACK being received, but it should be very long and on fail should reset to IDLE. If it gets a request for mission item while in idle/outside of upload it should return MISSON_ACK with invalid request .

@hamishwillee I assume we can close this.

There was no answer, so you as the questioner get to decide.

I guess in real life it isn't such a problem or this would have been answered - i.e. most of the time you'll get the ACK, and if you didn't the link is probably so bad the problem gets detected earlier with other timeouts.

FWIW the ack might now have an opaque id so the GCS really needs the ACK or it might then have to re-upload the whole mission. We could allow the Mission to assume mission upload success if the opaque id changes within two seconds (say) of upload and an ACK is missed?

Right, I think a full mission upload is just invalid if the ack happens to get missed, bad luck. I wouldn't make that assumption, at least not via the protocol, it sounds a bit too complicated.