On October 31, 2024, at 4:49 PM PDT, the ZetaChain mainnet experienced a network halt that lasted for approximately 6 hours and 30 minutes. The outage was caused by a consensus failure following a partial deployment of version 20.0.6 of the node software, which contained a new supply verification check that proved to be consensus-breaking despite the initial assessment that it was safe to deploy. The network was successfully restored by rolling back the node software to version 20.0.5 and coordinating with validators who had installed v20.0.6 to resync their nodes.
The immediate technical cause was a consensus failure triggered by a new supply verification check introduced in v20.0.6. This feature, intended to add additional verification on ZETA deposits into ZetaChain, unexpectedly the modified gas calculations for ZETA deposits, leading to consensus disagreement among validators.
The lengthy recovery time for this incident was related to resyncing and recovering the nodes which had attempted to use the v20.0.6 release. By the time snapshots were recovered on the nodes, there were 150+ rounds of the CometBFT consensus process and the exponential back off timer from previously failed rounds had made it very difficult for nodes to catch back up with the majority of the network.
The incident was resolved through the following steps:
Technical Insights
Process Improvements
We are implementing several improvements to prevent similar incidents:
Technical Process Improvements
Operational Changes
While this incident resulted in significant downtime, it has provided valuable insights for improving network infrastructure and processes. We remain committed to maintaining the stability and security of the ZetaChain network and will continue to implement these learnings to prevent similar incidents in the future.
We appreciate the community's patience and support during the resolution of this incident, and we thank our validator partners for their swift cooperation in restoring network operations. Especially ITRocket and Polkachu who’s small snapshots and fast download speeds helped some operators to quickly restore their nodes.