Every day, in some telegram/discord/[insert service here] validator chat, someone asks the question: how do I become a validator? We also hear the laments that the big validators only continue to get more votes, while the little guys can’t seem to win. For the latter, there is often a good reason, and that’s reliability! New validators that haven’t run enterprise-class systems are a risky investment. To compete as a small, independent validator, we have invested significant time into ensuring reliability. Hopefully, this article can help some new validators and reassure potential-stakers that Block Pane is a safe choice.
What’s the key to reliability? Monitoring! The first thing that any new validator should be thinking about is how to monitor their performance. In my opinion, there are six dimensions to take into account for maintaining a successful validation operation:
- Is the node producing blocks?
- Is everything running?
- How healthy are my systems?
- Is the system secure?
- Are all public endpoints available? **
- Governance, or … don’t miss a hard-fork!
** This goes beyond the topic of monitoring, but as a successful validator, you will likely provide several services such as public API endpoints, block explorers, or tools that your team or others have created for the community.
Is the operation: effective, operational, continual, trustworthy, and available?
And maybe the most important: Are the right people aware of any of the above? (So, actually seven dimensions with awareness being the final key.) All of the below systems are plugged into PagerDuty, with escalating notification policies working through multiple methods to contact someone when there is an alarm.
Is the node producing blocks?
It seems like the easiest question of the lot. But, often, it’s the hardest to answer. If you are sitting on Mintscan refreshing the page repeatedly to see if you missed blocks, you are doing it wrong. Catching missed blocks is a hard problem for all the chains that Block Pane operates on. With some chains making it harder to answer or harder to handle (when failing.) For the three chains we validate (or once have, in the case of Kusama,) it’s a very different problem. For example, on FIO (an eosio-based chain,) our monitoring tool, fio-bp-standby handles failover and notification/escalation duties. It’s usually successful in under five missed blocks, ensuring our block producer never misses an entire round. In this case, detection = remediation. And when the original producer recovers, it falls back. This is possible because double-signed blocks aren’t penalized on FIO, and it results in a single double-signed block. However, on other chains, a double-signed block can be a serious offense.
For both Polkadot and Cosmos-SDK chains, We’ve opted to handle failover manually, so detection must be swift and reliable, and someone is on-call to deal with the situation (not to mention a playbook is available for handling the actual failover.) However a new threshold signing tool: Horcrux, may eliminate this problem. In these cases, we’ve also resorted to writing tools that specifically detect failures. One such example is Tenderduty, a service that sends alerts to PagerDuty if missing blocks. This tool allows specifying a list of LPC endpoints, so it isn’t reliant on a single node. Another we provide is the 1kv-exporter which is a Prometheus exporter allowing for alarms to be set through Grafana.
Regardless of the method used, if a validator isn’t producing blocks and gets jailed, what use is the rest of the monitoring? This is the number-one priority of any validator; if precommits are missed and blocks not proposed, you have failed, and if you aren’t aware of this, it’s time to consider a new way of spending your time.
Is everything running?
In the beginning, with only a few systems, it’s easy to keep an eye on everything. But once an operation is running 20 or more servers, ensuring every service is running is a challenge. For example, we use Docker for running many ancillary services, such as Oracles, voting on FIO, monitoring, building public reports, and an array of websites. As I write this, there are more than 60 different containers we run, and each one is important. In another case, because we primarily run on hardware, monitoring the wear leveling of drives is critical because Cosmos-SDK-based blockchains chew up drives (it’s also important to know what drives providers use and prefer those that use datacenter quality drives, not all providers are equal, but that’s a topic for another day.) Are nodes syncing? Do they have enough peers? All these (and lot’s more) are important questions.
Although it’s not as commonly used, we chose M/Monit for watching services for several reasons: first, writing custom checks is dead simple even for complex health checks; put together a shell script that outputs some info and exits with a non-zero code when failing. The main reason, however, is that monit can fix things. daemon crashed? Restart it! Node lost all peers for 6 minutes? Restart it! It supports hierarchical relationships between services. If a blockchain node is failed, it can be configured not to restart the web service that consumes the node’s API even though it is failing.
The `monit` tool itself is free and open-source. However, M/Monit is a commercial product but is quite affordable. It doesn’t have many of the features of many commercial monitoring tools, and the UI is rather spartan. It does do what we need and no more, so despite the cost, I have chosen it over other similar systems such as alerta (which is also a great choice.)
How healthy are my systems?
There is one obvious choice for monitoring system performance, capacity, and load: Prometheus and Grafana. Additionally, most blockchains are built with Prometheus exporters, giving visibility into the node’s operation. There are two essential exporters cadvisor for Docker containers, and prometheus-node-exporter for Linux.
There are many things that Prometheus can monitor. Some of the most important that I track are block IO (have I oversubscribed a node, is it at risk for failing an SSD?,) system temperature (has enabling the performance governor put it at risk of thermal throttling?,) and of course all the information anyone is interested in like bandwidth, CPU load, memory consumption. It’s all so important that I have a dedicated 2nd monitor with Grafana running all the time.
Is the system secure?
I’ve worked in security for more than two decades, so of course, it was bound to get a mention :) But before talking about security monitoring it’s essential to point out that all of these monitoring systems create additional attack surfaces. We use Wireguard on every node and then bind the various listeners to the VPN’s network interface. This ensures remote access is only possible over the VPN. A handy tool for setting up a complex mesh network is wg-meshconf.
Back to the monitoring discussion! There are three primary questions I want to answer:
- Are my nodes fully patched?
- Did I properly lock down network access?
- Is anything bad happening on the nodes?
There is some overlap between the tools used for these questions:
Of course, the easiest thing to do to ensure the system is patched is to install the `unattended-upgrades` package for the OS. This handles most of the problems with no intervention. But there are a couple of issues with this. First, unattended kernel updates can fill up `/boot` partitions, and some services are critical (such as Docker or containerd,) so restarting them at random times is a problem. These packages often get an `apt hold` and won’t get patched. Knowing this is important if there is a critical kernel update or DoS in Docker. Also, by default unattended-upgrades will only install packages tagged as a security update. So what to do?
Prometheus can help here, node-exporter can track pending apt packages, and setting an alarm in Grafana can help know about non-security updates (but note it will not report pinned packages.)
The other tool I use is Wazuh. Wazuh combines an OSSEC agent, a correlation daemon, and an ELK stack with custom dashboards. One of the items it can track but has to be manually enabled are vulnerable packages. It follows this information from several sources, but my experience has been that the NVD detection module results in many false positives and is less valuable.
Wazuh can help here but to a limited degree. The easiest way to tell that a system hasn’t been locked down: alerts for brute-force attacks against SSH. It’s only marginally helpful in this case, and there is no substitute for a port scan.
Shodan is a service that scans the entire Internet. I’m not sure what the pricing is nowadays; I signed up for a lifetime account many years ago. But, it allows getting notifications when there are changes to your hosts. The downside is that it only scans for popular services, so accidentally exposing an RPC service won’t get noticed.
Ultimately there is only one solution to this, and that’s manual scanning. Every once in a while I’ll bring up a short-lived VPS somewhere and do a full 65535 port nmap scan of all my hosts. It takes hours but is a necessary exercise. Sometimes even I make mistakes (like forgetting to enable netfilter-persistent.)
I have been working on a tool/web service that will help with this. The idea is it will test for various blockchain RPC/P2P endpoints and report back. Send a curl from the system and get a report back, stuff like whether P2P is open, or if RPC/LPC is exposed (and have various unsafe features enabled, like the producer plugin on eosio, or the ability to dial peers on Tendermint.) Alas, this isn’t done yet, but I’ll be sure to post here when it is.
Finally, there are many services out there that do perform scanning for a fee. Some are free for a limited number of nodes, but overall the pricing was too high.
Host Intrusion Detection
Wazuh is primarily a HIDS, and that’s what it’s best for. It will catch many bad behaviors and easily plugs into PagerDuty. Tuning it is a bit painful, but the one time it catches a compromise early all the effort will have been worth it.
Another tool that could be useful is Qualys, which for server endpoint protection is reasonably priced and I have used before with good results. Unfortunately when I tried to sign up for the service it never provided the login credentials, and when I contacted their support they did not reply, but I did get the privilege of being added to their email marketing list, sigh.
Are all public endpoints available?
Even when all systems are operating nominally, it’s possible they aren’t available to users. For example, a network provider may have peering issues, and a part of the Internet may be unreachable, or a CDN provider might be having trouble. The two options I use for this are AWS’s Route 53 health checks and UptimeRobot.
Route 53 is a little less convenient since it requires an AWS Lambda function to interface with PagerDuty, but it has the upside of doing checks from many regions globally.
UptimeRobot, on the other hand, has a much simpler interface, is reasonably priced, directly integrates with PagerDuty, and has the upside of allowing the creation of public dashboards. The dashboards are handy when working with various blockchain foundations, where a paid proposal requires uptime reporting.
When do most validators get slashed for downtime? Network upgrades! We can’t watch the dozens of relevant chat rooms across the half dozen or so chat services all the time. What to do?
This is a simple problem, and there is one reliable indicator that an upgrade will be coming, and that’s to watch the git repository for new versions. There are two ways to do this. Github can send notifications via email, which is nice, but it’s easy to lose in the noise of pull request comments and other messages.
I handle this using a service called Zapier; it makes it easy to plug multiple steps together to build a “zap,” which is a kind of codeless microservice. I have Zapier configured to watch a project’s (for example on Polkadot available at https://github.com/paritytech/polkadot/releases.atom) RSS feed for new releases and then sends a PagerDuty alert. The timing can be inconvenient when the developers are in a different time zone, but it’s still worth the inconvenience on networks like Kusama, where they expect a very timely response to new releases and often give very little advance notice.
Here’s a quick set of links for various tools discussed (no affiliate links):
- AWS’s Route 53