Zetachain Mainnet Is Not Producing Blocks
Incident Report for ZetaChain
Postmortem

Incident Summary

On October 31, 2024, at 4:49 PM PDT, the ZetaChain mainnet experienced a network halt that lasted for approximately 6 hours and 30 minutes. The outage was caused by a consensus failure following a partial deployment of version 20.0.6 of the node software, which contained a new supply verification check that proved to be consensus-breaking despite the initial assessment that it was safe to deploy. The network was successfully restored by rolling back the node software to version 20.0.5 and coordinating with validators who had installed v20.0.6 to resync their nodes.

Impact

  • Complete halt of mainnet block production
  • Temporary suspension of all cross-chain transactions
  • Network downtime of 6 hours and 30 minutes

Root Cause

The immediate technical cause was a consensus failure triggered by a new supply verification check introduced in v20.0.6. This feature, intended to add additional verification on ZETA deposits into ZetaChain, unexpectedly the modified gas calculations for ZETA deposits, leading to consensus disagreement among validators.

The lengthy recovery time for this incident was related to resyncing and recovering the nodes which had attempted to use the v20.0.6 release. By the time snapshots were recovered on the nodes, there were 150+ rounds of the CometBFT consensus process and the exponential back off timer from previously failed rounds had made it very difficult for nodes to catch back up with the majority of the network.

Resolution

The incident was resolved through the following steps:

  1. Identification of the consensus-breaking change in v20.0.6
  2. Decision to roll back to a previously stable version (V20.0.[3-5])
  3. Coordination among validators to resync their nodes using snapshots

Key Learnings

  • Technical Insights

    • State reading operations can unexpectedly affect gas calculations
    • The longer the network is halted, the longer it takes to recovery due to CometBFT's voting backoff
    • Optimized voting timeouts can significantly improve recovery speed when dealing with a large number of CometBFT consensus rounds.
  • Process Improvements

    • Enhanced testing procedures needed for identifying consensus-breaking changes
    • More robust deployment processes for patches
    • Improved automation needed for rapid recovery of ZetaChain operated nodes

Preventive Measures

We are implementing several improvements to prevent similar incidents:

  • Technical Process Improvements

    • Enhancing testing procedures to identify consensus impacting changes
    • Implementing stricter criteria for rapid deployments
    • Establishing clear processes for rapid patch deployment
    • Optimizing node recovery procedures
  • Operational Changes

    • Enhancing validator coordination procedures
    • Improving emergency response protocols
    • Strengthening communication channels with node operators

Conclusion

While this incident resulted in significant downtime, it has provided valuable insights for improving network infrastructure and processes. We remain committed to maintaining the stability and security of the ZetaChain network and will continue to implement these learnings to prevent similar incidents in the future.

We appreciate the community's patience and support during the resolution of this incident, and we thank our validator partners for their swift cooperation in restoring network operations. Especially ITRocket and Polkachu who’s small snapshots and fast download speeds helped some operators to quickly restore their nodes.

Posted Nov 04, 2024 - 15:47 UTC

Resolved
The network is successfully producing blocks and all ZetaChain transactions are being processed as expected. We will share a post mortem tomorrow when we've reviewed the incident and replicate root cause using a development environment.
Posted Nov 01, 2024 - 06:00 UTC
Monitoring
The network is producing blocks again! We are monitoring for any lingering effects and reviewing the pending cross-chain transactions.

Our explorer and indexer will be back online soon.

All new ZetaChain transactions will be processed as expected.
Posted Nov 01, 2024 - 05:52 UTC
Identified
The network has been halted because of a consensus failure related to a new update that was rolled out to a small subset of validators. We believe we have identified the root cause of that consensus failure and will provide a more detailed update of root cause after we replicated the issue on a development network. The current priority is restoring network liveliness.

We have rolled back the validators we manage to the v20.0.5 version of the node software and have asked the community to do the same. Most of the community has completed this and we're seeing ~60% participation with the correct version of the software.

A few validators who were unable to participate in the vote after attempting to run the consensus breaking version (v20.0.6) and are resyncing from snapshots. Once those remaining validators are synced back up we expect the network to resume.

We will provide another update in 60 minutes and are hopeful the network will be restored before that time.
Posted Nov 01, 2024 - 04:23 UTC
Update
Engineers are engaged in troubleshooting the issue, and are actively working to restore network consensus.
Posted Nov 01, 2024 - 02:40 UTC
Investigating
We are currently investigating an issue on Zetachain Mainnet that has stopped block production on the network.
Posted Oct 31, 2024 - 23:53 UTC
This incident affected: Mainnet (ZetaChain Mainnet).