Zetachain Mainnet Is Not Producing Blocks

Write-up

Incident Summary

On October 31, 2024, at 4:49 PM PDT, the ZetaChain mainnet experienced a network halt that lasted for approximately 6 hours and 30 minutes. The outage was caused by a consensus failure following a partial deployment of version 20.0.6 of the node software, which contained a new supply verification check that proved to be consensus-breaking despite the initial assessment that it was safe to deploy. The network was successfully restored by rolling back the node software to version 20.0.5 and coordinating with validators who had installed v20.0.6 to resync their nodes.

Impact

Complete halt of mainnet block production
Temporary suspension of all cross-chain transactions
Network downtime of 6 hours and 30 minutes

Root Cause

The immediate technical cause was a consensus failure triggered by a new supply verification check introduced in v20.0.6. This feature, intended to add additional verification on ZETA deposits into ZetaChain, unexpectedly the modified gas calculations for ZETA deposits, leading to consensus disagreement among validators.

The lengthy recovery time for this incident was related to resyncing and recovering the nodes which had attempted to use the v20.0.6 release. By the time snapshots were recovered on the nodes, there were 150\+ rounds of the CometBFT consensus process and the exponential back off timer from previously failed rounds had made it very difficult for nodes to catch back up with the majority of the network.

Resolution

The incident was resolved through the following steps:

Identification of the consensus-breaking change in v20.0.6
Decision to roll back to a previously stable version \(V20.0.\[3-5\]\)
Coordination among validators to resync their nodes using snapshots

Key Learnings

Technical Insights
- State reading operations can unexpectedly affect gas calculations
- The longer the network is halted, the longer it takes to recovery due to CometBFT's voting backoff
- Optimized voting timeouts can significantly improve recovery speed when dealing with a large number of CometBFT consensus rounds.
Process Improvements
- Enhanced testing procedures needed for identifying consensus-breaking changes
- More robust deployment processes for patches
- Improved automation needed for rapid recovery of ZetaChain operated nodes

Preventive Measures

We are implementing several improvements to prevent similar incidents:

Technical Process Improvements
- Enhancing testing procedures to identify consensus impacting changes
- Implementing stricter criteria for rapid deployments
- Establishing clear processes for rapid patch deployment
- Optimizing node recovery procedures
Operational Changes
- Enhancing validator coordination procedures
- Improving emergency response protocols
- Strengthening communication channels with node operators

Conclusion

While this incident resulted in significant downtime, it has provided valuable insights for improving network infrastructure and processes. We remain committed to maintaining the stability and security of the ZetaChain network and will continue to implement these learnings to prevent similar incidents in the future.

We appreciate the community's patience and support during the resolution of this incident, and we thank our validator partners for their swift cooperation in restoring network operations. Especially ITRocket and Polkachu who’s small snapshots and fast download speeds helped some operators to quickly restore their nodes.