Todd G

Building Validators for Cosmos Chains

What happens behind the scenes when Block Pane joins a Cosmos/Tendermint chain as a validator? How many nodes, what configuration, and why should someone choose to delegate to us?

Nodes

The base configuration we start with is four nodes, three with publicly accessible P2P, and one completely isolated via a Wireguard mesh VPN. This is the minimum, and it’s not unlikely that we will build more if there is a need.

  • The validator: this is always on high-end hardware. Right now, the preferred configuration is on AMD 5950x systems, with 128 GiB ECC RAM, and SAS3 enterprise SSDs. This setup isn’t always possible because we distribute our nodes globally and use many different providers, so some of our nodes are still on Xeons, or we end up with enterprise SATA drives. As of mid-2021, the AMD has one of the highest single-thread benchmarks (more critical for EOSIO chains than Cosmos, but that’s another topic for another time) and many cores, but it isn’t as tuneable as a Xeon (like forcing the minimum clock speed,) so if you are reading this after 2021, it’s likely we’ve already moved on. The validator will share resources with other chains, but we ensure it has plenty of extra capacity. We do not add any new chains once the node exceeds 1/2 of RAM or an average load equalling 1/4 of the physical cores. (For example, the 5950x has 16 cores, so an average load of 4, or 64 GiB of RAM in active use, is considered at capacity.)
  • Seed node: we prefer to place our seed nodes in the cloud. Why? Because most of our peers are also in the cloud, we want the seed node to be as close and widely connected as possible. A seed node keeps a few persistent connections open because it needs to stay synced, so this also doubles as a sentry node. Seed nodes are dedicated 1:1 per chain but are minimally provisioned because compute resources in the cloud are expensive.
  • Full node: we always dedicate one exclusive bare-metal node for each chain. It is not likely a high-performance server. But it does insulate from a spike in activity elsewhere.
  • Full Node/Warm Standby: the last node is also a full node and publicly accessible but runs on high-end systems with the same specifications as our validator node. This node is a warm standby (meaning that production failover is not automatic, and a human has to rotate keys — the penalty for equivocation is much higher than for downtime.) This accessibility is a calculated risk if there is a failure on our primary producing validator running on a publicly accessible node for a short while is acceptable.

Location

Monitoring

  • We rely on PagerDuty for alerting and escalation.
  • We also use a bunch of other services, and it depends on what is needed. Some examples: Zapier can monitor for new releases on a git repository and alert via PagerDuty, or using Uptime Robot, Route53 Health Checks, or serverless functions for custom monitoring jobs.

If you made it this far, thanks for reading. Right now, we are active on the Osmosis and Sifchain networks, with more planned. Given the care we place in building a validator, it understandably takes a while to onboard new chains. Vote for us here:

Blockchain and Security Enthusiast