Post Mortem: How We Recovered Our Bricked Blockchain and Lessons for the Future

Question

Post Mortem: How We Recovered Our Bricked Blockchain and Lessons for the Future

Closed this issue a year ago · 0 comments

darkfriend77 commented a year ago

Post Mortem: How We Recovered Our Bricked Blockchain and Lessons for the Future

Incident Date: 2023-07-25 15:02:25

Resolved: 2023-07-26 11:30:25

Lead: Cédric Decoster

Summary

Our blockchain recently experienced a bricking incident due to an elongated runtime upgrade and storage migration. In this post-mortem, we will provide an in-depth analysis of the incident, how we managed to resolve it, and the lessons learned to prevent such issues in the future.

What Happened?

In simple terms, our blockchain got stuck in a loop. The chain kept trying to complete a storage migration but failed because the time needed exceeded the allocated time on the collators. This resulted in our blockchain getting "bricked," becoming unusable until we took corrective action.

Timeline of Events

Tested Runtime Upgrade: First, on a solo node, using a cloned storage from production.
Tested on Rococo: Used try-runtime to test the upgrade.
Deployed on Rococo: Deployed the runtime upgrade with storage migration on Rococo.
Further Testing: Conducted additional tests.
Production Testing: Used try-runtime to test the upgrade with production storage.
Production Deployment: Rolled out the runtime upgrade and storage migration on the production chain.

ValidationFunctionApplied for runtimeUpgrade 0.1.20 on
BajunNetwork(Parachain) 2'600'440
Kusama(Relay) 18'943'684

Chain Bricked: The chain entered an unusable state, no more block produced in time, by our collators.

2023-07-25 14:29:24 [Parachain] Starting collation. relay_parent=0xd90fb54de6b3dbe9f9da31cc0ed0de4b6776448bff0da3917432fc798e439cb9 at=0x3c07058fcf3ecee9e822df4d8def197646e4d4c12795f7a25a26b182ee93055f
2023-07-25 14:29:24 [Parachain] 🙌 Starting consensus session on top of parent 0x3c07058fcf3ecee9e822df4d8def197646e4d4c12795f7a25a26b182ee93055f
2023-07-25 14:29:24 [Parachain] Updated GlobalConfig
2023-07-25 14:29:24 [Parachain] Migrated current season status
2023-07-25 14:29:24 [Parachain] Updated 991 accounts and 12859 avatars
2023-07-25 14:29:25 [Parachain] Updated 854 avatars in trade
2023-07-25 14:29:25 [Parachain] ⌛️ Discarding proposal for slot 140857947; block production took too long
2023-07-25 14:29:26 [Parachain] Updated 12859 old avatars
2023-07-25 14:29:26 [Parachain] Updated 4003 player account info entries
2023-07-25 14:29:26 [Parachain] Migrated seasons
2023-07-25 14:29:26 [Parachain] Upgraded storage to version StorageVersion(5)

Deep Analysis: Realized that the storage migration was too slow.
Unbricked the Chain: Utilized a powerful collator setup.
Resumed Block Production: The chain returned to normal operation.

Solutions and Workarounds

Possible Solutions:

Powerful Collator: Running the collator on a more powerful i9 CPU.
- Status: Successful
- Evaluation: Proved that the issue was computational and was in a range where we could resolve it by providing enough power, so it would pass collators and then hopefully also pass validators on the relay chain.
Governance Vote for codeSubstitute: This would revert the network to an older, stable version and would require initiating a Root Track.
- Status: Not Implemented
- Evaluation: Could take up to 14 days for governance approval. This is a long-term fix.

Resources & External Help:

We referenced a blog post by T3rn which provided valuable insights.
We also reached out to Parity for support, receiving exceptional assistance from Santiago & Daan.

Actions Taken

Collator Upgrade: Employed a collator with superior single-threaded performance to complete the storage migration.

our normal collators
2023-07-25 14:28:45 [Parachain] :gift: Prepared block for proposing at 2600441 (911 ms) [hash: 0x306e4b6184ef89d486f4cdb1af158c6b83b9466e6edd8f2c84666f67b99e74eb; parent_hash: 0x3c07…055f; extrinsics (2): [0xf8f7…c1e5, 0x92bf…81a3]]

our i9 local machine collator
ajuna-collator-1-1 | 2023-07-26 10:00:10 [Parachain] :gift: Prepared block for proposing at 2600441 (194 ms) [hash: 0xb0c410f751588f826ce9311d4821dea838a97b450d5cdd9dbc69dda513edc957; parent_hash: 0x3c07…055f; extrinsics (2): [0x239d…730b, 0

Review of Test Protocols: Overhauled our testing procedures.
Time Estimation: Developed tools to estimate storage migration times for future use.

Root Cause Analysis

The underlying cause was the lengthy storage migration time needed during the runtime upgrade, which exceeded the capabilities of the existing collators.

Key Errors

Error Message: "Discarding proposal for slot; block production took too long."

Could This Have Been Prevented?

Absolutely. With more cautious time estimates and exhaustive testing under different conditions, the bricking could have been avoided.

Lessons Learned and Next Steps

Testing: Strengthen testing strategies for runtime upgrades and migrations.
Monitoring: Employ real-time monitoring tools to quickly identify anomalies.
Collaboration: Build stronger relationships with external organizations and experts.

Recommendations

Performance Testing: Adopt rigorous testing for all runtime upgrades, particularly focusing on storage migration time.
Revert Plan: Maintain a well-documented revert plan ready to be deployed.
Hardware Benchmark: Publish minimum hardware requirements for collators.
Monitoring and Alerts: Develop robust monitoring to quickly identify failed block productions.

Conclusion

While the incident was unfortunate, it presented us with invaluable lessons and opportunities for significant process improvements. We are committed to ensuring the resilience and reliability of our blockchain moving forward.

Special thanks to Santiago & Daan from Parity and Christian from Integritee for their incredible support during this crisis.

If you have any questions or concerns, feel free to reach out to us. Thank you for your continued support and trust.

This should provide a thorough, well-structured post-mortem report that can serve as a valuable resource for your team and others in the blockchain community.

Post Mortem: How We Recovered Our Bricked Blockchain and Lessons for the Future

Incident Date: 2023-07-25 15:02:25

Resolved: 2023-07-26 11:30:25

Lead: Cédric Decoster

Summary

Table of Contents

What Happened?

Timeline of Events

Solutions and Workarounds

Possible Solutions:

Resources & External Help:

Actions Taken

Root Cause Analysis

Key Errors

Could This Have Been Prevented?

Lessons Learned and Next Steps

Recommendations

Conclusion