aws/aws-fpga

fpga_mgmt_load_local_image_sync() failures

ljp0101 opened this issue · 12 comments

We've been experiencing agfi loading errors starting from two days ago. This has happened 4x, 3x in one EKS cluster and 1x in a second, and we do ~30x scaling events per day so failure rate is 5-10%. Our code hasn't changed recently, which suggests to me that a change on the aws side might be causing this.

Error:

Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If

Code:

   // load bitfile
   {
      int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 1000, 30, &fpga_mgmt_image_info);
      if (result_check != 0) {
         b_error = true;
         error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
   }

My understanding of the function call is it's trying 1000 x 30ms so for 30s. I will try wrapping fpga_mgmt_load_local_image_sync() with retry functionality within my drivers to mitigate but 30s is a timescale that should be beyond failure.

Hello,
We are investigating the issue that you've reported.
In order to troubleshoot this issue further, we will need you to open a support case from your AWS account.
Thanks!

No support plan so can't create a ticket.

These are UTC times and IPs for instances where we had problems:

Jan 31 00:19:31.771 ip-12-63-201-12.ec2.internal
Jan 31 00:19:28.497 ip-12-63-14-64.ec2.internal
Jan 29 23:07:06.589 ip-12-63-11-159.ec2.internal

Jan 31 17:21:55.295 ip-11-63-1-53.ec2.internal
Jan 30 19:31:36.423 ip-11-63-11-107.ec2.internal

Happy to share account numbers if there's a private way to do so.

The easiest path forward for us to support you on this would be for you to open an AWS support case under the AWS account where you are experiencing this issue.

That will allow us to provide support privately

If you can also provide

  • the affected instance IDs
  • the region(s) you noted the issue in
  • a rough timeline of when you started seeing the issue and when (if) it stopped occurring

that will assist us in identifying the issue that you are reporting

Us-east-1 and all incidents & times as per prior post. I have no good way to correlate the historic IPs with instance IDs.

I supplemented our logging with instance IDs so will update if and when we see again (but I also changed the retry strategy so may have quashed our problem).

Hello!

We have been investigating this issue. Were unable to resolve this issue with your new retry strategy, or have there been further failures? Is there anything else we can help with on this issue?

The bitfile loading problem resolved with the code below. I may be seeing the same underlying problem manifest elsewhere though as I've also been having occasional communication problems with the fpga starting around the same time. Will close this issue and reopen if I can collect actionable data.

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }    
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load: fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }

Checked my logs and I have an instance ID & time (UTC).

You'll note I'm an idiot and did bad error checking (retry checking should be against 9, not 10). This was a lucky mistake though as I know the FPGA worked properly despite the error response to the load function.

The left timestamps are from datadog so not completely accurate and right timestamps where available are system clock. I see the time spent in the bitfile loading function is 1708469173623 - 1708469153936 = 19.7 seconds, which is odd because I expect the elapsed time should be 10 attempts * 100 sub attempts (function argument) * 30ms wait (function argument) = 30 seconds.

Feb 20 22:46:03.015 [fpga-xxx] 1708469153936 InstanceId: i-0919988b3ae3b675d
Feb 20 22:46:03.015 [fpga-xxx] 1708469154006 State: BOOT
Feb 20 22:46:03.015 Loading fpga with bitfile named 'agfi-xxx'
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 1
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 2
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 3
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 4
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 5
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 6
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 7
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 8
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 9
Feb 20 22:46:14.022 WARNING: fpga_mgmt_load_local_image_sync() failed, attempt 10
Feb 20 22:46:14.022 [fpga-xxx] 1708469173623 State: LOAD_READY
Feb 20 22:46:16.022 [fpga-xxx] 1708469175447 Command: LOAD

   // load bitfile
   for (uint i = 0; i < 10; ++i) {
      const int result_check = fpga_mgmt_load_local_image_sync(0, const_cast<char*>(bitfile_name.c_str()), 100, 30, &fpga_mgmt_image_info);
      if (result_check == 0) {
         break;      
      }
      std::cout << "WARNING: fpga_mgmt_load_local_image_sync() failed, attempt " << i + 1 << std::endl;
      if (i == 10) {
         b_error = true;
         error_message << "Fpga::Load | fpga_mgmt_load_local_image_sync() failed; ";
         error_message << "AWS error code: " << result_check << "; ";
         error_message << "AWS error message: " << fpga_mgmt_strerror_long(result_check);
         return;
      }
      usleep(10000);
   }

Hi,

I'm glad to hear that the FPGA functioned properly in this case. I'm wondering what's the returned status when the load command does not return as loaded in your retry loop. As I noticed that the function might skip the built-in retry and retry delay if the return status is neither succeed or busy (

/** Wait until the status is "loaded" or timeout */
while (!done) {
ret = fpga_mgmt_describe_cmd(opt->slot_id, &tmp_info, 0); /** flags==0 */
status = (ret == 0) ? tmp_info.status : FPGA_STATUS_END;
if (status == FPGA_STATUS_LOADED) {
/** Sanity check the afi_id */
ret = (strncmp(opt->afi_id, tmp_info.ids.afi_id, sizeof(tmp_info.ids.afi_id))) ?
FPGA_ERR_FAIL : 0;
fail_on(ret, out, "AFI ID mismatch: requested afi_id=%s, loaded afi_id=%s",
opt->afi_id, tmp_info.ids.afi_id);
done = true;
} else if (status == FPGA_STATUS_BUSY) {
fail_on(ret = (retries >= timeout_tmp) ? -ETIMEDOUT : 0, out,
"fpga_mgmt_describe_local_image timed out, status=%s(%d), retries=%u",
FPGA_STATUS2STR(status), status, retries);
retries++;
msleep(delay_msec_tmp);
} else {
/**
* Catch error status cases here.
* -the caller can then display the error status and cause upon return.
*/
ret = tmp_info.status_q;
goto out;
}
}
) assuming you're the function coming from the fpga_mgmt library.

I initially observed -110, which is the stderr code for a connection timeout.

Fpga::Load: fpga_mgmt_load_local_image_sync() failed; AWS error code: -110; AWS error message: A time out error is usually spurious and the request can be retried. If

Thanks, I think that's reason the retry time observed didn't match to the calculation. But timeout can be ignored and request should be retried. Please let me know if you have any other questions.

I did retry the fpga_mgmt function call 10x with 10ms separation between function calls and it kept failing so it doesn't look like I can handle the error appropriately in any way besides terminating the instance and starting fresh.

I'll close the issue again because terminating the instance is good enough for my use case if the incident rate doesn't go above a few times per week but it seems to me like there's some more fundamental issue to be addressed as the underlying connection timeout error doesn't make much sense to me while nothing else is happening except loading the bitfile.