Intel-BMC/openbmc

Redfish code update sometimes failed on uploading tarball

Closed this issue · 5 comments

Unexpected behavior you saw
When running Redfish code update, sometimes the upload failed.

Expected behavior
Redfish code update should "almost always" succeed.

To Reproduce
Precondition:

  • The BMC is running "intel" branch, revision ca9404e.
  • Both eth0 and eth1 are connected in LAN, each with ip 192.168.1.100, 192.168.1.101, and both work fine.
    Steps to reproduce:
    Login and run Redfish code update, and see below error.
$ curl -v -k -H "X-Auth-Token: $token" -H "Content-Type: application/octet-stream" -X POST -T intel-platforms-image-update-intel-ast2500-20200603070047.tar  https://${bmc}/redfish/v1/UpdateS
ervice
* Expire in 0 ms for 6 (transfer 0xfef880)
* Expire in 1 ms for 1 (transfer 0xfef880)
* Expire in 0 ms for 1 (transfer 0xfef880)
...
*   Trying 192.168.1.100...
* TCP_NODELAY set
* Expire in 200 ms for 4 (transfer 0xfef880)
* Connected to intel-bmc0 (192.168.1.100) port 443 (#0)
* ALPN, offering h2
* ALPN, offering http/1.1
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* TLSv1.3 (OUT), TLS handshake, Client hello (1):
* TLSv1.3 (IN), TLS handshake, Server hello (2):
* TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
* TLSv1.3 (IN), TLS handshake, Certificate (11):
* TLSv1.3 (IN), TLS handshake, CERT verify (15):
* TLSv1.3 (IN), TLS handshake, Finished (20):
* TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
* TLSv1.3 (OUT), TLS handshake, Finished (20):
* SSL connection using TLSv1.3 / TLS_AES_256_GCM_SHA384
* ALPN, server did not agree to a protocol
* Server certificate:
*  subject: C=US; O=OpenBMC; CN=testhost
*  start date: May 14 09:32:41 2020 GMT
*  expire date: May 12 09:32:41 2030 GMT
*  issuer: C=US; O=OpenBMC; CN=testhost
*  SSL certificate verify result: self signed certificate (18), continuing anyway.
> POST /redfish/v1/UpdateService HTTP/1.1
> Host: intel-bmc0
> User-Agent: curl/7.64.0
> Accept: */*
> X-Auth-Token: s55QYWKQ9eVM4TSEwDa3
> Content-Type: application/octet-stream
> Content-Length: 20172800
> Expect: 100-continue
>
* Expire in 1000 ms for 0 (transfer 0xfef880)
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* TLSv1.3 (IN), TLS handshake, Newsession Ticket (4):
* old SSL session ID is stale, removing
* Done waiting for 100-continue
* SSL_write() returned SYSCALL, errno = 32
* Closing connection 0
curl: (55) SSL_write() returned SYSCALL, errno = 32

Additional context

  • Note that it is not 100% reproducible, but when it occurs, try the curl command again, it will give the same error.
  • When the issue occurs, the BMC kernel produces below error
[ 4870.243320] ftgmac100 1e660000.ethernet eth1: NCSI Channel 0 timed out!
[ 4873.043266] ------------[ cut here ]------------
[ 4873.048025] WARNING: CPU: 0 PID: 0 at net/sched/sch_generic.c:447 dev_watchdog+0x268/0x288
[ 4873.056450] NETDEV WATCHDOG: eth1 (ftgmac100): transmit queue 0 timed out
[ 4873.063359] CPU: 0 PID: 0 Comm: swapper Not tainted 5.4.32-2cf4956-dirty-a6b4544 #1
[ 4873.071029] Hardware name: Generic DT based system
[ 4873.075907] Backtrace:
[ 4873.078429] [<80107ddc>] (dump_backtrace) from [<80108010>] (show_stack+0x20/0x24)
[ 4873.086122]  r7:000001bf r6:806720a0 r5:00000009 r4:8098790c
[ 4873.091823] [<80107ff0>] (show_stack) from [<80776fc0>] (dump_stack+0x20/0x28)
[ 4873.099174] [<80776fa0>] (dump_stack) from [<801166e4>] (__warn+0xcc/0x100)
[ 4873.106268] [<80116618>] (__warn) from [<80116b00>] (warn_slowpath_fmt+0x9c/0xc4)
[ 4873.113837]  r7:806720a0 r6:000001bf r5:8098790c r4:809878d0
[ 4873.119570] [<80116a68>] (warn_slowpath_fmt) from [<806720a0>] (dev_watchdog+0x268/0x288)
[ 4873.127882]  r8:9e4c7800 r7:80b13120 r6:00000000 r5:9e49a700 r4:9e4c7a30
[ 4873.134704] [<80671e38>] (dev_watchdog) from [<8015e704>] (call_timer_fn+0x40/0x128)
[ 4873.142501]  r8:0006fa58 r7:00000100 r6:80671e38 r5:80b01dfc r4:9e4c7a30
[ 4873.149330] [<8015e6c4>] (call_timer_fn) from [<8015eb14>] (run_timer_softirq+0x1a0/0x4a0)
[ 4873.157712]  r8:0006fa58 r7:00000000 r6:9e4c7a30 r5:80b01dfc r4:80b13e40
[ 4873.164517] [<8015e974>] (run_timer_softirq) from [<801022b4>] (__do_softirq+0xdc/0x318)
[ 4873.172664]  r10:00000002 r9:80b00000 r8:00000100 r7:80b00000 r6:00000001 r5:00000002
[ 4873.180689]  r4:80b76d94
[ 4873.183349] [<801021d8>] (__do_softirq) from [<8011ad18>] (irq_exit+0xd0/0xe4)
[ 4873.190618]  r10:00c5387d r9:80b00000 r8:00000000 r7:9d417600 r6:00000001 r5:80b5745c
[ 4873.198533]  r4:00000000
[ 4873.201128] [<8011ac48>] (irq_exit) from [<80151090>] (__handle_domain_irq+0x60/0xb4)
[ 4873.209065] [<80151030>] (__handle_domain_irq) from [<801021cc>] (avic_handle_irq+0x68/0x70)
[ 4873.217626]  r9:80b00000 r8:80a39a40 r7:80b01f1c r6:ffffffff r5:80b01ee8 r4:9d402c60
[ 4873.225480] [<80102164>] (avic_handle_irq) from [<80101a6c>] (__irq_svc+0x6c/0x90)
[ 4873.233142] Exception stack(0x80b01ee8 to 0x80b01f30)
[ 4873.238319] 1ee0:                   00000000 00000000 00000000 00000000 80b00000 80b03260
[ 4873.246626] 1f00: 80b5a4a4 808fde60 80a39a40 410fb767 00c5387d 80b01f44 80b01f48 80b01f38
[ 4873.254899] 1f20: 80103c50 80103c54 60000013 ffffffff
[ 4873.259970]  r5:60000013 r4:80103c54
[ 4873.263675] [<80103c1c>] (arch_cpu_idle) from [<80793860>] (default_idle_call+0x30/0x3c)
[ 4873.271805] [<80793830>] (default_idle_call) from [<801418f4>] (do_idle+0x90/0xf4)
[ 4873.279526] [<80141864>] (do_idle) from [<80141c20>] (cpu_startup_entry+0x1c/0x20)
[ 4873.287216]  r7:00000000 r6:80b03200 r5:00000001 r4:80b0d194
[ 4873.292923] [<80141c04>] (cpu_startup_entry) from [<8078e58c>] (rest_init+0x7c/0x9c)
[ 4873.300799] [<8078e510>] (rest_init) from [<80a00d30>] (arch_call_rest_init+0x18/0x1c)
[ 4873.308826]  r5:00000001 r4:80b766f4
[ 4873.312437] [<80a00d18>] (arch_call_rest_init) from [<80a011d4>] (start_kernel+0x428/0x4f4)
[ 4873.320897] [<80a00dac>] (start_kernel) from [<00000000>] (0x0)
[ 4873.326922] ---[ end trace 09684eb6ddf2838a ]---
[ 4873.396835] ftgmac100 1e660000.ethernet eth1: NCSI: No channel found to configure!

Not sure if it's related, as the weired part is that I am using eth0 in this case, while the above error seems to be related to eth1.

@leiyu-bytedance Just want to know how easy to reproduce this issue? 1 fail in 10 times?
Any abnormal journal logs on the BMC when this issue occurs?

@leiyu-bytedance Just want to know how easy to reproduce this issue? 1 fail in 10 times?
Any abnormal journal logs on the BMC when this issue occurs?

I just tried a few code updates and hit the issue twice, so maybe like 2/5.

There is one thing that is confusing:

  • It was using eth0's IP (192.168.1.100) to do the code update,
  • When the issue occurs, kernel reports the above issue on eth1, and the network is not stable. (e.g. the code update fails, and it's very slow to open a page on WebUI from the eth0's IP)
  • If I change to use eth1's IP (192.168.1.101), the code update becomes OK, as well as the WebUI.

@leiyu-bytedance Just want to know how easy to reproduce this issue? 1 fail in 10 times?
Any abnormal journal logs on the BMC when this issue occurs?

I just tried a few code updates and hit the issue twice, so maybe like 2/5.

There is one thing that is confusing:

  • It was using eth0's IP (192.168.1.100) to do the code update,
  • When the issue occurs, kernel reports the above issue on eth1, and the network is not stable. (e.g. the code update fails, and it's very slow to open a page on WebUI from the eth0's IP)
  • If I change to use eth1's IP (192.168.1.101), the code update becomes OK, as well as the WebUI.

It seems that this is a networking issue, instead of the fw update issue(flash erase/write). Suggest to unplug the eth1 network cable, only connected to eth0(it is the BMC dedicated port), then re-test it to narrow down this issue.

It seems that this is a networking issue, instead of the fw update issue(flash erase/write). Suggest to unplug the eth1 network cable, only connected to eth0(it is the BMC dedicated port), then re-test it to narrow down this issue.

Yup, it's more like a networking issue.
It may only be triggered by the above case (both eth0 and eth1 are connected), and doing redfish code update may make it easier to reproduce because it transfers large data.

It's not reproduced for a while, but I do not which revision fixes the issue.
Close this anyway.