Who crashed Junø?

Todd G
13 min readJul 26, 2022
Graphic showing magnifying glass
Image credit: Agence Olloweb via unsplash

This article is a journey from the view of a validator through the April 5th halt of the Juno network. I review some of the findings uncovered along the way and the process of reviewing the attack. It does not cover the story of recovering the chain.

[ block pane ] operates a validator on the Juno network. We are not part of the core team, and while we frequently collaborate with Juno’s core-2 team, this is an independent analysis of the incident and does not represent any opinion other than the author’s.

No rest for the weary

It was morning, and after being up late, I finally decided to call it a day at around 5 a.m. when my phone started screaming at me. I hoped to get a few hours of sleep before the impending “Lupercalia” release on the Juno network. When I saw the message, I wondered if I had messed up my time calculations from local time to UTC; Pagerduty duly informed me: “ALERT have not seen a new block in 10 minutes juno-1” which is common during an upgrade. However, checking my monitoring dashboards showed that I had not missed the upgrade, and there were still a few thousand blocks to go.

Server logs showed a series of “Apphash” errors for block 2578098. This error indicates that my node disagrees with my validator peers on what the previous block (2578097) should be; in other words, I was on a fork of the chain, and we could not reach a consensus. Sometimes this resolves itself, and the best course of action is to resist the urge to restart servers or worse: restoring from an archive and trying to catch back up to the head block. During a consensus failure, it’s easy to make a mistake and double-sign (aka tombstone,) which results in a hefty 5% penalty for delegators.

Initial response

The Juno validator’s Discord channel was abuzz as everyone tried to figure out what was happening. The initial suspicion was that this was caused by validators running an incorrectly linked binary. Usually, when building a program, it will use libraries (collections of useful pre-compiled functions) already present on the system. Dynamic-linking results in smaller programs and is the way most software is built. Why was linking relevant?

Terra emergency patch

Rewind to Dec. 9, 2021. Terra developers (Terra also uses cosmwasm smart-contracts) announced an emergency update to validators on Discord. The upgrade was a binary-only release where devs had checked no changes into the publicly available GIT repository, an exceptional situation for an open blockchain. There was an issue where different dynamically linked libraries would result in different gas usage calculations in specific circumstances. I am not familiar with what was required to exploit this vulnerability, but the result is that validators might not agree on the contents of a block, and the chain could come to a halt. Sound familiar? The solution was to have all validators use the same “terrad” binary statically linked against a library called “musl libc”. The muslc library is a lightweight library that provides most of the standard libraries in Linux.

Terra vulnerability notification in Discord channel

What is the problem with calculating different amounts of gas used? It results in non-deterministic transactions. If any single part of a block is different, validators will reject it. If 33% of vote power disagrees, the chain will stop until more than 67% of the validators concur on the actual block.

Juno “musl libc” patch

Later in December 2021, Juno took the same approach, asking all validators to run the same muslc-based static binary. As long as more than 67% of the validators used the correct binary, it would prevent the attack from being successful.

It’s important to point out that it wasn’t enough to run the correct version of the software, which all of the validators were when the chain halted, but that it was necessary to use a binary that was statically linked against a specific version of muslc.

It was the most plausible explanation for Juno’s chain halting: were 33% or more of the validators were running on a dynamically-linked version? It had also been more than three months since core-2 put out the patch. A good number of the validators had forgotten that downloading the source and compiling it caused a risk to the network. There is also the case of those who use automation tools such as Ansible, Cloud-init, SaltStack, Terraform, or K8S to automate the process of building nodes. I am sad to admit that between December and April, 100% of my Juno nodes had been rebuilt at least once using Ansible playbooks, which did not take the static-linked binary into account.

First attempt to salvage the chain

At this point, it was all hands on deck. The developers and validators were frantically trying to contact all of the other validators to ascertain whether the correct binaries were being used. Ultimately they couldn’t determine the exact number, and last I remember, it was split at 65% and 31%, making it plausible but not sure that it was just the wrong binary. Validators on the wrong version were urged to update their binary and restart, but the chain still could not regain consensus.

It could still be something else, and perhaps a different exploit had been used. Maybe even a 0-day.

Here is where the Juno team’s experience shines through. They changed their recommendation from “let’s fix it right away” to “everybody stop, don’t touch anything”; this issue will halt this chain until at least tomorrow. Why was this a wise choice? Many people had been up for a long time, it was early morning in Europe, and the chances of human error increased. As I mentioned before, trying to restart a chain during a consensus failure is tricky, and during one of the various Juno testnets, we had run into exactly this situation. It resulted in the tombstoning of over 33% of the validators; in other words, it killed the chain. The call to step back, triage, analyze, and remediate cleanly was the best decision.

Triage continues

In the background and outside of public view, several different groups started working on various aspects of the recovery and root-cause analysis. One group was working through plausible scenarios for re-launching the chain. Should there be a hard fork, is it possible to roll back a block (no, it would halt again after tombstoning everyone) export state and create a new genesis? The implications of each were examined and tested. Others were looking at on-chain data or reverse engineering smart contracts.

Something important to understand about Juno is that it’s very different from most chains. There is no founding company with paid employees. All of those contributing are doing so as volunteers. There’s no overtime clock running for working all night, but despite this that’s what around 30 different people were doing. Anyone who questions Juno’s viability or future would only need to witness this to have their opinion changed.

My first reaction was to extract the pending transactions from my validator’s memqueue (because they are ephemeral), and I started looking for anything out of the ordinary. After looking through the first few hundred (more than needed), I determined nothing was interesting in the queue. Around this time, I was informed that there was one specific suspicious transaction in the last block to be committed and asked if I would like to join a group of those investigating it. I was happy to help and honored to have been asked.

I got lucky. I usually use settings on my nodes that reduce the amount of storage by removing historic state information, indexes that allow searching for specific transactions, and even pruning old blocks. But, it’s almost tax time, and I needed a full node to track down transactions to handle more complex searches, and the default state pruning left the node with a few weeks of state data. This state data contains the stored values in the database at any given height. So it would allow checking the values in storage both before and after each suspicious transaction.

I started by dumping all transactions for the account that sent the suspicious transaction. In this case, it was a call to a cosmwasm contract. Looking through all of these transactions yielded some odd patterns but nothing that seemed overly suspicious. I then dumped all transactions calling the contract and found a second account interacting with it. Although the frequency was odd, it still yielded no clues. Finally, I decided to parse out all block heights for each transaction and query the contract state at each height. But found nothing again. It was time to pick up the kids from school, so I took a break.

I realized on the drive-home that I hadn’t grabbed all of the state data, I had missed pulling the last page, and had missed 78 transactions. Once I got back, I finished the job, and the results looked something like this:

The last state where the chain halted had a new value in “name.” Finally, something interesting! Next, I checked the “name” value on a block explorer, which differed. It seemed that this contract was the smoking gun we were seeking and non-determinism was the cause of the chain halting. I brought this to the group handling various reverse-engineering against the contract, and the devs noticed two critical things. The first question was, from where did the name field come? The contract calls had appeared to be interacting with the example “hello-world” Terra cosmwasm contract, but that field was not part of that contract. Either there was some misdirection at play, or the attacker had used hello-world as a base. The second and most important fact was that every single node they queried had a different value in the name field.

Juno had split into 125 different networks.

I immediately suspected this was one of the two security vulnerabilities patched with the Lupercalia release. When this happened, there were three cosmwasm security issues affecting Juno.

Coincidently, Halborn (one of my previous employers,) had responsibly disclosed another the day of the attack. The patch for the 3rd security issue was included when the chain was relaunched.

The second vulnerability

Juno’s testnets have been some of the best learning experiences of all validator work I’ve done. In December 21', we launched eight testnets; some imploded spectacularly, some didn’t, but it was a crucible, and the group of validators that sludged through it all came out the other side with some incredible knowledge and experience. We’re talking about some of the top validator teams in the space like Rhino, Mercury, Lavender Five, King nodes, and a bunch that I forget (sorry.) After the end of the year, the testnets had quieted down a bit. I don’t remember the exact dates, but we had recently launched the “uni-2” testnet. One morning I received a familiar sign of trouble from Pagerduty “ALERT validator has missed 3 blocks on uni-2”.

Three or four validator nodes had dropped at the same time. My validator logs showed “Apphash” errors, indicating that the database was likely toast. I moved my keys over to another node and got back to the business of signing blocks. My experience is that Apphash errors usually occur in one of two situations. A misconfiguration (like trying to prune a state needed for a snapshot) or a bug in the software. I started first by digging through logs and found nothing. Meanwhile, I’m also chatting in the Testnet Discord to see if anyone else knew what had happened. I kept digging and finally came across a call to a smart contract on the block where we dropped a handful of validators.

This transaction had something suspicious in the call data. It was using an ABCI query against historical state. ABCI stands for Application BlockChain Interface and is how the various layers of the Cosmos/Tendermint software-stack store or access data. First, it seemed dangerous to allow ABCI queries in a contract, and I thought this wasn’t possible from what I had read in the cosmwasm documentation. But it’s dangerous to allow a smart contract to interact with historical state. As I mentioned before, most of my nodes are tuned to be lightweight and fast, which involves pruning-state, and if I remember correctly, I was pruning very aggressively on this node (every ten blocks.)

I started asking the other validators who had crashed in the Testnet Discord if they were also pruning state. Sometimes I’m overly exuberant, and this is one of those cases. I brought up my theory that this was a security vulnerability, and someone was testing exploits against non-deterministic state queries on the testnet. I was asked to stop talking about the vulnerability in a public channel (“the-frey” is always very polite), and I removed my posts. Oops, sorry, guys.

This vulnerability was the primary security fix targeted in “Lupercalia,” and now took center stage in the hunt for what had caused Juno to halt.

Pivot

The group handling the forensic investigation had some incredibly knowledgeable developers on the team.

Sometimes knowing when to back off, shut up, and cause no distractions is the best course of action for an incident handler. I was in over my head and had little more to contribute to reverse-engineering the smart-contract.

The developers working on reverse-engineering the contract could identify exactly how the exploit worked and reliably recreate it to test any proposed solutions to prevent reoccurrence. It was excellent work, and I’m not going to re-hash their findings here, but both Assaf from Secret Labs and Jack from Strange Love shined. In other words, they had this, and I was better suited to contribute elsewhere. It was time for me to pivot to another aspect of incident handling where my experience could make a difference. I started building a timeline of events, studying the TTPs of the attacker and tracing them to identify the attacker.

Building an incident timeline is a painstakingly slow process. There aren’t a lot of great forensic tools available for Cosmos chains, so every action had to be reviewed manually to construct a reasonable timeline of events.

Was it malicious? My opinion: absolutely.

The attacker put effort into hiding the source of the funds used to execute the exploit. Despite this effort, it was rudimentary to track the funds back to their source. Their obfuscation attempt involved using a couple of distributed exchanges (uniswap and sifchain,) a non-KYC exchange (ChangeNow,) and IBC transfers or bridges. The sheer number of transfers alone suggests an attempt at making it difficult to track (throughout the timeline involving: Terra, Anchor, Ethereum, Uniswap, Sifchain, Sif’s Dex, Juno, Akash, and Osmosis.) They could have easily just used IBC from Terra to Juno and swapped roughly 20 $UST for $JUNO and been done with it.

Excerpt from timeline showing transfer to CCN

Another indicator is a an outbound transaction sending funds to the controversial CCN organization’s (aka the Juno Whale) airdrop account immediately before instantiating the malicious smart contract. It is the author’s opinion that this is either a poorly executed attempt at misdirection/attribution, or an obvious protest against the Juno Prop16 drama (perhaps a bit of both?)

It’s also possible that mimicking the hello-world contract was another attempt at misdirection, downplaying their activity to hide in plain sight, but that may be giving them too much credit.

What do we know about the attacker?

We know a lot about the attacker’s activities, but not their identity. It does not appear there was any financial benefit, and he or she may not have known the severity of what would happen. After analyzing their actions and historical account activity from the funding account, here are my assumptions.

The attacker is:

  • likely a developer, probably a Rust developer, and possibly familiar with cosmwasm contract dev on Terra and the vulnerabilities they have remediated.
  • capable of writing a novel exploit given only the description of a vulnerability; suggesting proficiency with cosmwasm.
  • an avid NFT trader, having bought and sold dozens of NFTs on RandomEarth.
  • very active in the Anchor protocol and are able to take significant profits from Anchor and NFTs.
  • primarily using the Kucoin exchange as an off/on ramp for tokens, but also use a second yet-unidentified exchange.
  • either offended by Prop16 or angered by the lack of progress on its implementation.
  • Not experienced in obfuscating the movement of funds on the blockchain and therefore probably not regularly engaged in criminal activity.

Had they contacted the Core-1 team at the outset of the incident it’s likely that I wouldn’t be publishing this information. They may have even gotten a bounty and probably been able to remain anonymous. But now the damage is done, they decided to sit back and watch, and so I don’t feel any misgivings about posting what I was able to uncover.

Timeline

Force graph showing activity over time
Event Timeline: Force graph showing events over time. Exported from Maltego.

This following list is abbreviated, but the above graphic exported from Maltego (full PDF report available here) shows the relationships and a few of the details I left out.

Note: omitted for brevity is a continuous stream of smart-contract executions calling try_increment and reset between Apr. 2 and Apr. 5, this was continuous and possibly scripted

March 31:

April 2:

April 4:

  • Send 1 $JUNO from juno1hxkppd7spnvm5s86z2rfze5pndg9wwee8g9x6v to juno18vmzvz3eym7ky98uq45z02rf5j6dh00cekjlv0
  • juno18vmzvz3eym7ky98uq45z02rf5j6dh00cekjlv0 interacts with contract juno188lvtzkvjjhgzrakha6qdg3zlvps3fz6m0s984e0wrnulq4px9zqhnleye alternating roughly 2/5 with juno1hxkppd7spnvm5s86z2rfze5pndg9wwee8g9x6v

April 5:

  • Send 0.5 $JUNO from juno18vmzvz3eym7ky98uq45z02rf5j6dh00cekjlv0 back to juno1hxkppd7spnvm5s86z2rfze5pndg9wwee8g9x6v
  • Call to reset function on contract from juno1hxkppd7spnvm5s86z2rfze5pndg9wwee8g9x6v vs contract juno188lvtzkvjjhgzrakha6qdg3zlvps3fz6m0s984e0wrnulq4px9zqhnleye results in chain-halt.

--

--