Truck Factor: The Silent Debt

May 7, 2025

Understanding the Truck Factor

At SRECon EMEA, I came across the concept of the Truck Factor, something new to me and, judging by the crowd’s reaction, new to most of the 500-strong audience as well. In simple terms, the Truck Factor refers to the number of engineers who could leave (or get hit by a truck) before the lack of knowledge continuity brings software velocity to a grinding halt. A more popular term for this is the Bus Factor, though it’s not clear if we should thank transit authorities for that one.

The Truck Factor isn't just a theory; it has teeth. When those key engineers vanish, projects stall, onboarding new engineers becomes a marathon, and error rates across systems skyrocket. Companies like Google and Microsoft have explored this phenomenon, but the research that really drives the point home comes from Avelino et al. In their analysis of 87 systems, they found that in 65% of cases, only one or two people held critical knowledge of the codebase. And when they surveyed developers from 67 projects, a whopping 84% confirmed that these “main authors” did indeed possess the bulk of the knowledge. Two people. Just two. No pressure.

Mostly an Engineering Problem

Is this uniquely an engineering problem? Well, yes, it mostly is. Other departments have knowledge continuity figured out, more or less. Sales, for instance, has Client Relationship Management (CRM) systems, making it relatively easy for another salesperson to step in without missing a beat. In sales, knowledge variability is handled through structured data logged meticulously into CRMs, but in software development? The knowledge silos are far more specific: "Yanish is responsible for the installer, ask Yanish. Jordan is the expert on the aggregation service, ask Jordan."

Documentation [lack of] is often blamed as the problem, and also hailed as the potential solution. But let’s be honest, getting developers to write comprehensive, up-to-date documentation is like trying to get a cat to take a bath. No one wants to spend 30 minutes writing docs when they could be diving into the next coding task. So documentation gets slapped together as an afterthought, scattered across Confluence, GitHub, Notion, or Google Docs, frequently losing relevance and accuracy over time.

Observability tools? Sure, they can provide some help. But these tools are often fragmented, costly, and lack interoperability across environments and regions. Monitoring production systems can shed light on some aspects of system behavior but rarely offer the kind of holistic view needed to make knowledge truly transferable. It doesn’t help that no one seems to own this problem. It’s the kind of operational risk managers love to kick down the road.

When multiple key engineers leave in quick succession, it sets off a chain reaction. Those left behind are dumped with extra responsibilities, stress mounts, and soon they leave too. The R&D department ends up in a tailspin, feature delivery slows to a crawl, and customer churn rates creep up. Executives scramble to spin this to investors, buying time by promising a return to form once the issue is addressed. But it never quite is. The engineer churn continues, and before long, the product is an irreparable mess, ready for a corporate garage sale. Sound familiar? It should. Welcome to the new world of SaaS, a place littered with MySQL wrappers and well-intentioned broken dreams.

The silver lining? Truck Factor issues often come with early warning signs. Are your principal engineers constantly being dragged into day-to-day firefighting? Does your on-call team routinely defer to “Mary the Savior” for every P0 incident? Relying heavily on subject matter experts for routine crises is like building a house on sand. Yes, these engineers are amazing, but wouldn’t it be better if everyone on the team could grasp what’s going wrong and fix it, too? Besides, the emotional toll on these key individuals is unsustainable.

This raises a critical question: How can we make system knowledge more accessible?

Addressing the Truck Factor

The current industry favorite is the Internal Developer Portal (IDP). I agree they can be game-changers, but only if we lay the right foundation. The first pillar in this foundation is, surprise, knowledge itself. And not just any knowledge, but an understanding of how production systems operate, starting from the code and working outward. Traditional Application Performance Monitoring (APM) takes an infrastructure view, but it doesn’t tell the whole story. We need to look at the service lifecycle: from initial commit to CI/CD pipelines and into production. There’s a wealth of context to be mined here.

By helping developers understand the behavior of their code in a real-world setting, we make knowledge more relatable. Remember, developers spend 70% of their time understanding code and only 5% actually writing it. While IDPs can aggregate and display static documentation, connecting the dots still requires a deeper system understanding. What if we could automate this process? Imagine on-demand API documentation that updates itself with rich, production-related context. Sign me up.

The final piece of the puzzle is the IDP itself. While these platforms reduce cognitive load by centralizing key information, they shouldn’t be relied upon as knowledge generators. They’re there to facilitate, not solve.

In a world of microservices, we may be saddled with operational complexity, but we’re also presented with a unique opportunity to truly understand our systems, an opportunity that wasn’t as visible in the age of monoliths. Most organizations lack the luxury of unlimited resources to achieve this easily, but investing in developer-centric observability can go a long way toward democratizing production knowledge.

So, keep your engineers safe and happy. Failing that, invest in making your production systems understandable from the code up.

References: Avelino et al. research paper