Select Page
Image from Jan Wildeboer's original article comparing the xz-utils incident and the Crowdstrike fiasco
DCIM\100MEDIA\DJI_0141.JPG – from Jan Wildeboer’s camera

I liked what Jan Wildeboer had to say about the xz-utils and Crowdstrike incidents so much, I asked Jan’s permission to paste it as a guest blog post right here. I made one minor formatting change from Jan’s original post. I put headers above each slide, instead of below in Jan’s original, and I cheated with a couple words in most headers to help with SEO. Enjoy Jan’s analysis. And learn, just like I did. Care and share to be prepared.

Jan is a fellow Red Hat employee from Munich, Germany. Here is how he describes himself in his original article.

That Open Guy. Transnational citizen. Red Hat’s EMEA Evangelist during the day, societal hacker in the dark. He/Him.

https://jan.wildeboer.net/2024/08/xz-v-crowdstrike-presentation


This is a kind of transcript from a presentation I did on 2024-08-08 internally at Red Hat. Some people expressed interest in that presentation, so I present (pun intended 😉 a raw transcript of the story I told to my fellow Red Hatters during a Lunch & Learn session. Enjoy!

DISCLAIMER These slides and speaker notes express MY PERSONAL OPINION and do NOT reflect in any way Red Hat’s position.

Slide 1

Slide 1

Hello everyone. Some preliminaries. I will not go too deep on technical details in this presentation, if you want to dive deeper, just follow the reference links on slide #11. They can give you hours of pleasure. At least they did for me!

I want to focus on more relevant topics: how these two tales unfolded, how they were reported, what REALLY happened and, finally, what we can learn from them. My name is Jan Wildeboer. 30+ years in Open Source as user, developer and community person. Let’s go, shall we?

xz-utils – Slide 2

xz-utils - Slide 2

We will start with the XZ-Utils tale. It spans many years but has just a few players that are involved. Which makes this complex story of open source technology, social engineering and a movie-worthy last minute catastrophic meltdown, caused by a surprise guest so interesting and impressive.

xz-utils – Slide 3

xz-utils - Slide 3

So what is XZ-Utils? A little project that does compression and decompression of content, typically files. That’s it. It’s one of those little projects that works in the background, reliably, unobtrusively, without much drama. It started back in 2005, first release of the xz format was in 2009, the main (and for a long time only) developer was and is Lasse Collin.

For many years, xz-utils was rock-solid, needed little attention and Lasse was the good guy keeping it all running. A little quirk of Lasse is that he sometimes more or less disappears for a while, because computers and software aren’t the only important things in life. Sometimes you want or have to take care of other things. Which is perfectly fine, even a bit adorable, in my humble opinion.

Anyway. Things were about to change, starting around 2021.

xz-utils – Slide 4

xz-utils - Slide 4

We know a lot more today, especially because Lasse made sure he shared everything he could reconstruct from memory and code history. Again, as is typical, he took his time to make sure everything he says is correct. He refused interviews. No hype. Facts only. So.

In 2021 a person (or maybe it was more than one person? We just don’t know) called Jia Tan showed up, suggesting some little patches, nothing weird. Lasse looked at them, accepted them into the xz-utils code. Just another normal day in Open Source.

Little did Lasse know at that time that a storm was brewing. A storm that started taking up speed in 2022. Remember, Lasse had quite a laissez-faire attitude towards xz-utils. Progress was slow, and that was by (his) design. But suddenly some people (or were they part of a team? Sock puppet accounts from Jia Tan? We don’t know) started putting pressure on Lasse. Complaining about the slow progress. Of lack of support for some use cases. Pushing for adding a new maintainer to the project.

Lasse listened to the critics, admitted things could be done a bit different and maybe this guy who sent some patches, Jia Tan, could play a more important role. Again, still business as usual in Open Source.

So in 2023, Jia Tan became a bit of a rising star in this small community. Lasse handed him more responsibilities, gave him control of the github repo (which was NOT the main repo at the time, that was still Lasse’s Kutaani one) and Jia Tan added himself to more and more parts of the project, integrated new connections that he controlled but that was about it — seemingly. It still looked like business as usual. Jia Tan helped, Lasse did his thing, all is good.

xz-utils – Slide 5

xz-utils - Slide 5

But that was all about to change. In February 2024, so 3 years after Jia Tan showed up for the first time, weird things started to happen. But hidden very well. You see, projects like xz-utils are not just a simple collection of source code, documentation and some nice web pages, they serve the needs of distinct target groups. Mostly users. But also distro maintainers, tasked to make packages for Debian, Fedora, Red Hat and many, many more distributions out there. XZ-utils thus had a few special files and build artifacts, purely of interest to those package maintainers. They wer not part of the normal distribution aimed at users, but they were still part of the project.

And exactly there is where Jia Tan did his evil things. And he (or the team, again, we will never know) hid it well. He planted innocent “test files” that actually, through a complicated set of operations, turned into the backdoor being inserted to the code. But ONLY if you used the package maintainer path. Even more precise: only if you were building deb packages for Debian or rpm packages for Fedora or RHEL on 64 bit Intel/AMD.

Package maintainers typically maintain A LOT of packages, so this wasn’t immediately obvious. But it had side effects. That were noticed. Jia Tan (I suppose) observed these side effect carefully and acted to hide his trails even better. Blamed some weird checking functions and declared it’ll be better to remove those tests.

But some were not convinced. And, due to other reasons, a possibly important target vector for his attacks, systemd, decided to remove the dependency on liblzma (part of xz-utils) completely. The attack window began to close. So Jia Tan began to build pressure on maintainers to include the 5.6.1 version of his (backdoored) code before the systemd change found its way into the big distributions.

Suddenly the other actors (or were they just sock puppet accounts?) joined the movement. Debian, Fedora maintainers were pummeled to update to 5.6.1 really fast, because … reasons. The backdoored package found its way to debian-testing, Fedora rawhide and 40. Testing releases. Known to have problems, because that’s what they are for. To find problems and solve them before the next official release of a distribution. If it would survive this stage, it would suddenly be everywhere. Yay for Jai, bad for all of us.

And then the surprise guest enters. Anders Freund. A PostgreSQL developer, employed by Microsoft. He was running some benchmarks on his debian-testing machine and wondered about a weird change in behaviour. His machine was connected directly to the internet, which, as we admins know, means you get hammered by loads of automated scripts, trying to “hack” into your machine via ssh. Hundreds of attempts per hour. It’s a kind of background noise. These connection requests normally are rejected in a matter of milliseconds, with almost no impact. But not on Anders machine. The attempts used a lot of resources and even caused crashes. Something was wrong.

So he dived deep. Because the whole stack is Open Source, he could observe every step and analyse what the hell was going on. And he found the backdoor. Wow. Astonished as he was, he wrote down his findings and knowing he might be on to something big, he shared his findings on 2024-03-28 with distros@openwall. An old mailing list where all big and small Linux distros participate to share finding about possible problems. And the distro@ people understood what was happening here. They got the risk. They reacted immediately. Within 24 hours, debian, Red Hat and others rolled back to a previous version of xz-utils. Analysed the impact. Wrote the CVE and informed the world.

Catastrophe averted. Because of Open. Because of communication. Because of expertise. Wow. Again.

So a more or less happy end. Lasse, shocked by what happened to his project, started cleaning up the mess. Took back control. Analysed what happened. The media went into a frenzy. “Linux is hacked!” “Open Source is insecure!” and a lot more. People with no real knowledge of the facts shared their ever more radical takes. Calls for government control (and/or money).

But the important fact remains. It was solved BEFORE it hit he big stage. Catastrophe averted. Let’s keep that in mind when we switch to the second tale. Crowdstrike.

Crowdstrike – Slide 6

Crowdstrike - Slide 6

So xz-utils was the trainwreck that didn’t happen. But a few months later a 78 minute trainwreck happened and it had ripple effects that went far beyond, IMHO, the Y2K fears of 24 years ago. Which also, just like xz-utils, didn’t really happen because a LOT of people spent countless hours of fixing and testing. But in July 2024, a few things went wrong, causing a global outage that no one expected. Crowdstrike.

Crowdstrike – Slide 7

Crowdstrike - Slide 7

Crowdstrike is a company founded in 2011, that offers, besides other services, software that runs on Linux, MacOS and Windows called the Falcon Sensor. It’s a package that goes deep into the system to analyse incoming network traffic, compare it to a collection of known attack patterns and, when an attack pattern is identified, block that attack and inform the admins of the fact.

So we are talking about complex pattern matching on a very low level. Deep in the kernel of the operating system. These patterns are distributed from crowdstrike to subscribed systems as fast as possible. As soon as a new attack pattern is identified, the rules, regexes (regular expressions) are modified, extended and, after some integration testing on crowdstrikes side, sent to the subscribed systems to make sure they are protected.

This all happens automatically. In a matter of minutes. That’s one of the unique selling points of Crowdstrike. But it’s a complex system. So the typical customer isn’t the private user. Crowdstrike focuses on big companies, with a lot of endpoints at a lot of places. It also isn’t cheap. So airlines, Fortune 500 companies use Crowdstrike. And they trust them to do things right. That trust was about to be tested.

Crowdstrike – Slide 8

Crowdstrike - Slide 8

Deep inside of the architecture of Crowdstrike Falcon Sensor for Windows is a method to translate templates to actual sensors. This system takes in a collection of templates with many parameters that gets translated by the sensor software on the endpoint system to code that can detect the attack described by the template. So far, so good. These templates were extended in February 2024 to have up to 21 parameters. But. The automated testing that happens before distribution of these templates could only handle 20 parameters.

Someone forgot to keep the testing in sync with the templates. But as the templates never used the 21st parameter, this oopsie went unnoticed. It also seems that Crowdstrike relied on the automated, abstracted integration testing and didn’t do a final test on a real system.

You can guess what happened next.

On the 19th of July, 2024, a template was added to the channel file 291 for Windows that for the first time used the 21st parameter. It ran through the automated integration tests that only checked 20 parameters, all was good, So the update was distributed to Windows machines all over the planet. And there it crashed. The 21st parameter led to a null pointer after installing the update. The windows kernel crashed. Crowdstrike noticed the problem. Went into analysis mode. Found the problem with the 21st parameter and produced another update to fix the problem.

But it was already too late.

In the 78 minutes between publishing the first update and the second one, 8.5 million systems had used the broken update. And they couldn’t simply be fixed by a reboot and catching the new update. Because, obviously, the system would try to load the current, broken, version BEFORE trying to get an update, because security!

The ripple effects were catastrophic. Endpoint systems were stuck in a boot loop. Airlines couldn’t check in passengers. Corporate laptops at WFH (Working From Home) employees didn’t start up. And admins everywhere were struggling to find out what the hell was going on.

Crowdstrike inadvertently added to the chaos by hiding their analysis behind their customer paywall. Many of these endpoint systems were not managed by their direct customers (like airlines) but by outsourced, local service companies (e.g. at 100s of airports) who simply didn’t have access to the information hiding behind Crowdstrikes paywall.

Once the problem was identified, resolving it turned out to be a complex task. Remote updates didn’t work, you had to physically go to these machines and try to get the broken 291 update removed or replaced. Many of these systems used bitlocker hard drive encryption. So now you needed to also get the system-specific keys, go to the affected system, boot it in rescue mode, enter the bitlocker key and restore the windows box to a working state.

This meant that you hopefully had a backup of the system-specific bitlocker key. Admins got creative. They printed out the keys as QR-code. Organised QR-code scanners that have a USB connection that simulated keyboard input. So they walked around at airports, noted the identifier of the endpoint, found the correct QR code in their paperwork, connected the scanner to the system, booted it into rescue mode, waited until it asked for the key, scanned the QR code and restored the windows box to a working state.

What. A. Mess. Globally. Because of an Oopsie between the developers and the Quality Assurance teams. Immediate impact. Immediate catastrophe.

Ultimately fixed. But compared to XZ-Utils, this catastrophe happened. It couldn’t be avoided at the time. Crowdstrike has promised to fix their mistakes and they promised to make sure it can never happen again.

xz-utils and Crowdstrike – Slide 9

xz-utils and Crowdstrike - Slide 9

So. Here we are. Two tales of two different worlds. Both stories ultimately ended in a good way. Solutions were found. In both cases at a high price. In one case in a mostly immaterial way. Trust was tested, but catastrophe was averted. In the other case, the catastrophe happened. But can this be reduced to a binary story? Open is good, proprietary is bad?

My conclusion is a clear and loud NO.

What we can learn – Slide 10

What we can learn - Slide 10

Open Source isn’t a magic solution. Proprietary isn’t always a bad thing. It’s not about licenses. It’s about communication, feedback loops and making sure it is about the truth, no matter how much it hurts to document and openly admit to mistakes. Learning from mistakes is a process, not a final result. Money or authoritarian oversight by governments isn’t the simple solution some make it seem to be. Tech reporting is driven by clickbait, all too often.

We should learn in the open. Regardless of licenses and investor interests. And we will make mistakes. We will fail. But by being open about it and listening to people in the know on how to be better, we can become better and better with every mistake made. Is all. It sound simple (and it actually is, IMHO) but we need to accept that failure will happen. That we need to be ready to fix, whatever happens. That’s the hard lesson.

References – Slide 11

References - Slide 11

Here are Jan’s references as links. See his source page.