There have been a few operational incidents in which large validators on the Cosmos Hub experienced prolonged downtime. These incidents haven’t been problematic for the Cosmos Hub, which has thus far proved resilient, despite individual validator failures. And interestingly, they’re not necessarily problematic for the individual validators themselves — at least not when it comes to rewards.
In the Cosmos Hub 2 genesis parameters, a validator in the active set still receives block rewards while offline, and only misses proposer rewards. That means an offline validator can remain in the active set and receive rewards for nearly 14 hours, according to Felix Lutsch from Chorus One.
Correction: The mechanism for downtime slashing after 9,500 of 10,000 missed blocks in a rolling window is in the protocol, whereas the number of missed blocks and the slashing percentage are in the genesis parameters. Thanks to Joe Bowman for alerting us to the need to correct this!
Every validator is going to suffer incidents from time to time, though Figment’ operations include measures to make that as unlikely as possible. With somewhat high-profile downtime incidents happening in such short succession, we thought it would be a good time for validators to start talking openly about their infrastructure and operations. So let’s get started!
Here’s our annotated disclosure for the Figment Cosmos validator.
? Hardware and Cloud Services ⛈
We are not in favour of cloud-only Cosmos validator operations. There is currently no support for the required signing curve in any cloud-based HSM service, which means that any cloud-based validator is by necessity storing plaintext private key material on disk and in memory on their cloud instances. For Figment, this is an unacceptable risk.
Our hardware nodes are Dell PowerEdge servers all with dual power supplies and dual network interfaces, and ZFS (Zettabyte File System) RaidZ arrays on SSDs (solid-state drives).
Our cloud instances vary in configuration, depending on cloud provider. They are generally similar in capacity to an AWS m5.large. The infrastructure is spread between AWS, Google Cloud, Digital Ocean, OVH and Linode. Our cloud sentry nodes are dynamic, and can be spun up as needed. By nature, sentry nodes are vulnerable to DoS attacks, so it is important to spread nodes across cloud providers to mitigate this risk.
Operations and Uptime ⏳
In general, a very simple, insecure and high risk validator operation will look perfect until it suddenly isn’t.
Uptime is a term that is open to interpretation, and can be very misleading, particularly because the Tendermint PBFT protocol is tolerant of individual validator failures. The Cosmos Hub currently jails a validator that misses 95% of a rolling 10,000 block window. This corresponds to a failure of roughly 16 hours. While best avoided, outages shorter than the jailing window are deemed to be safe, as indicated by the Cosmos Hub genesis parameters.
It is our view that short periods of downtime may actually be correlated with validator quality.
The multi-tier HSM:IDC:CLOUD architecture that Figment and other high quality professional validators operate introduces latency at each step, which makes missed blocks more frequent than they will be for a validator running with hot keys on an exposed cloud instance.
We are making a ‘security over liveness’ trade-off, which we think is the right trade to make.
Several of the best validators on the Cosmos Hub have experienced outages related to their redundant key signing infrastructure, or the complexity of their sentry architectures. Figment was a genesis validator on cosmoshub-1and cosmoshub-2, and so far, we have had no significant downtime events. Though this has not yet happened to us, it likely will. We are making a security over liveness trade-off, which we think is the right trade to make. It is our view that short periods of downtime may actually be correlated with validator quality.
Monitoring practices ?
Figment employs both internal and external monitoring and alerting. We use internal agent-based monitoring and alerting, and we use Hubble, our open source Cosmos block & validator explorer for external alerting.
Beyond standard agent-based server and network monitoring, we instrument the
gaiad process and generate alerts for cautionary conditions. For example, we track mempool size, number of active peer sessions, and other internal metrics in order to generate alerts for issues that could lead to impairment. We use the RPC (remote procedure call) to monitor block-by-block on our active validator and use
statsd to send a count of signed blocks to our internal agent-based monitoring. Alerts are generated if the signed block rate slows below expected, or if signed blocks are not reported.
In addition to internal agent-based monitoring, Hubble provides an external view of validator health by ingesting the blockchain. Hubbles operations are independent and separate from Figment’s validator infrastructure. Alerts are generated if Figment’s validator misses 150 consecutive blocks or 50 of 1000 blocks.
On-call practices ?
Our ‘on call’ role is currently shared on a rotation by the Figment team. We use Pagerduty to manage SMS paging as a backup to e-mail notifications. We operate in a general philosophy that if we are being paged, there is an architectural problem that needs addressing. To date we have had one emergency page since launching our first CosmosSDK mainnet.
Figment’s server access is by SSH (secure shell) keys, with bastion servers and IP restrictions or VPNs. Our IDC facility is SSAE 18 / ISAE 3402 / CSAE 3416 / SOC 2 certified.
Access to validator private key ?
We use the only HSM solution (hardware security module) that is currently supported by the CosmosSDK — the YubiHSM2. Our consensus keys were generated on HSM, and then were replicated to redundant HSMs using an air-gapped laptop. The Tendermint Key Management System (KMS) manages the HSMs.
Role-based accounts are used to manage HSM access, with least privileged account credentials on production infrastructure. Production KMS instances have credentials allowing for signatures to be generated, and no other access to the HSMs. More privileged credentials are stored offline, and are only used to create new HSMs when needed. Key backups and credential recovery materials are stored in a bank vault.
Worst-Case Scenario ☢️
Compromise of a validator’s private consensus key is a threat that cannot be well managed in the Cosmos architecture. There is no ability to rotate or revoke keys, so the toolset is limited. The worst case outcome of a key compromise event is a malicious double sign, resulting in the slashing and tombstoning of the validator.
Figment operates a mixed architecture of IDC (Internet data centre) hosted hardware, supported by cloud nodes spread across five cloud service providers. Our IDC network is interconnected to our cloud providers by a mixed network of direct connections and VPNs (virtual private networks). The validators are on RFC1918 IP networks, and peer-only with cloud-based sentry nodes that we control.
Our IDC facility is built to Tier III / IV standards, with 2N power and cooling and full A+B power and network paths.
Our sentry nodes are operating in North America, Europe and Asia. In addition to public-facing sentry nodes, we operate a private peering node via AWS (Amazon Web Services) that peers with other capable private validator nodes via VPC (virtual private cloud) peering.
Node Configuration ?
While cloud nodes can be dynamic, Figment’s fleet of cloud nodes does not change frequently. Thus our fleet of Cosmos sentries is not updated frequently, and does not require autoscaling or rapid redeployment. Basic tasks like snapshots and system backups are automated.
We manually configure our nodes. When Cosmos software updates are required, we first develop update processes and create update scripts on our testnet nodes, and then we manually deploy these update scripts on production nodes one at a time.
We’re in the process of bringing all of our cloud nodes into an Ansible-based automation system for maintenance. However, due to the relatively small number of nodes involved and the immaturity of the Cosmos software, we expect the manual work involved in Cosmos updates and maintenance to significant for the foreseeable future.
Full Disclosure ?
There it is! I think that’s everything.
If you’re a validator, we hope that this disclosure will help inform your operations and infrastructure decisions. If you’re a delegator, we hope that this will help inform the questions you ask the validator(s) you choose to delegate to. Overall, we hope that other validators will follow suit with full disclosure. Let’s get talking about how to make staking on Cosmos as secure as possible, and let’s be open books for delegators to make informed decisions ?
Hopefully you found this useful. Feedback is always welcome!