skalenetwork/IMA

Investigate and fix wrong S-Chain discovery status problem

sergiy-skalelabs opened this issue · 1 comments

We detected logically wrong S-Chain discovery situation. First we saw log message about successful 16 out of 16 S-Chain nodes completely discovered:

2023-09-18 15:26:06.072: S-Chain network discovery: Have S-Chain description response about 16 of 16 node(s).
2023-09-18 15:26:06.072: S-Chain network discovery: This S-Chain discovery will finish with 16 of 16 node(s) discovered.

But later we saw information about at least one S-Chain node was discovered partially or not discovered at all:

    2023-09-19 13:11:12.116: CRITICAL ERROR: BLS 1/16 public key discovery failed for node #10, node data is: {"httpRpcPort":10131,"httpRpcPort6":0,"httpsRpcPort":10136,"httpsRpcPort6":0,"ip":"34.217.246.35","ip6":"","nodeID":35,"schainIndex":11,"wsRpcPort":10130,"wsRpcPort6":0,"wssRpcPort":10135,"wssRpcPort6":0,"pwaState":{"oracle":{"isInProgress":false,"ts":0},"m2s":{"isInProgress":false,"ts":0},"s2m":{"isInProgress":false,"ts":0},"s2s":{"mapS2S":{"0":{"isInProgress":false,"ts":0}}}}}
    2023-09-19 13:11:12.116: RAW/BLS/#10: CRITICAL ERROR: BLS node #10 verify error: error description is: BLS 1/16 public key discovery failed for node #10, stack is: 
Error: BLS 1/16 public key discovery failed for node #10
    --> discoverPublicKeyByIndex (/ima/agent/bls.mjs:166:15)
    --> Module.doVerifyReadyHash (/ima/agent/bls.mjs:2503:29)
    --> Module.handleLoopStateArrived (/ima/agent/pwa.mjs:229:26)
    --> ObserverServer.self.mapApiHandlers.skale_imaNotifyLoopWork (/ima/agent/loopWorker.mjs:210:21)
    --> InWorkerServerPipe._onPipeMessage (/ima/npms/skale-cool-socket/socketServer.mjs:90:73)
    --> InWorkerServerPipe.dispatchEvent (/ima/npms/skale-cool-socket/eventDispatcher.mjs:105:22)
    --> InWorkerServerPipe.implReceive (/ima/npms/skale-cool-socket/socket.mjs:287:14)
    --> InWorkerServerPipe.receive (/ima/npms/skale-cool-socket/socket.mjs:324:14)
    --> InWorkerSocketServerAcceptor.receiveForClientPort (/ima/npms/skale-cool-socket/socket.mjs:581:14)
    --> Object.onMessage (/ima/npms/skale-cool-socket/socket.mjs:444:29)
    2023-09-19 13:11:12.116: RAW/BLS/#10: CRITICAL ERROR: BLS node #10 verify output is:

These 2 log messages are completely incompatible with each other and demonstrating situation which must not happen in real life.
So, S-Chain discovery results may be saved or treated incorrect as successful. This means S-chain discovery code must perform stronger validation of S-Chain node description JSONs came from skale_imaInfo calls to skaled and also ensure awaiting for S-Chain discovery compete is not done until it's really done.

Can't reproduce