On October 31, 2024, at 4:49 PM PDT, the ZetaChain mainnet experienced a network halt that lasted for approximately 6 hours and 30 minutes. The outage was caused by a consensus failure following a partial deployment of version 20.0.6 of the node software, which contained a new supply verification check that proved to be consensus-breaking despite the initial assessment that it was safe to deploy. The network was successfully restored by rolling back the node software to version 20.0.5 and coordinating with validators who had installed v20.0.6 to resync their nodes.
Complete halt of mainnet block production
Temporary suspension of all cross-chain transactions
Network downtime of 6 hours and 30 minutes
The immediate technical cause was a consensus failure triggered by a new supply verification check introduced in v20.0.6. This feature, intended to add additional verification on ZETA deposits into ZetaChain, unexpectedly the modified gas calculations for ZETA deposits, leading to consensus disagreement among validators.
The lengthy recovery time for this incident was related to resyncing and recovering the nodes which had attempted to use the v20.0.6 release. By the time snapshots were recovered on the nodes, there were 150\+ rounds of the CometBFT consensus process and the exponential back off timer from previously failed rounds had made it very difficult for nodes to catch back up with the majority of the network.
The incident was resolved through the following steps:
Identification of the consensus-breaking change in v20.0.6
Decision to roll back to a previously stable version \(V20.0.\[3-5\]\)
Coordination among validators to resync their nodes using snapshots
Technical Insights
State reading operations can unexpectedly affect gas calculations
The longer the network is halted, the longer it takes to recovery due to CometBFT's voting backoff
Optimized voting timeouts can significantly improve recovery speed when dealing with a large number of CometBFT consensus rounds.
Process Improvements
Enhanced testing procedures needed for identifying consensus-breaking changes
More robust deployment processes for patches
Improved automation needed for rapid recovery of ZetaChain operated nodes
We are implementing several improvements to prevent similar incidents:
Technical Process Improvements
Enhancing testing procedures to identify consensus impacting changes
Implementing stricter criteria for rapid deployments
Establishing clear processes for rapid patch deployment
Optimizing node recovery procedures
Operational Changes
Enhancing validator coordination procedures
Improving emergency response protocols
Strengthening communication channels with node operators
While this incident resulted in significant downtime, it has provided valuable insights for improving network infrastructure and processes. We remain committed to maintaining the stability and security of the ZetaChain network and will continue to implement these learnings to prevent similar incidents in the future.
We appreciate the community's patience and support during the resolution of this incident, and we thank our validator partners for their swift cooperation in restoring network operations. Especially ITRocket and Polkachu who’s small snapshots and fast download speeds helped some operators to quickly restore their nodes.