Table of Contents

Errors are not failure, unless you make them behave that way. #

The Concept of Failure? #

In the past several years as Exact Assembly expanded to a team, I’ve spent a majority of my time working with people instead of just writing code or building boards. That is of course what’s expected of managers and business owners, but for someone who’s education and life focus has been technology, it’s a stark reminder that complex systems are often non-linear in unexpected ways. Often this is because of differing expectations between vendors and customers, its not much fun to find out you’ve depending on something your upstream suppliers never committed to, or your client wildly over-estimated what a deliverable can do.

In our industry even the ‘simplest’ project involves multiple technologies and suppliers, and an intensive process of product ideation and requirements capture to go from an idea to something actionable. After the kickoff every project has an element of Rumsfeldian unknown-unknowns to discover and defeat. While many of the gremlins might ultimately be rooted in math-like solvable problems, there is often multiple aspects that make the solutions non-technical in nature. If and issue comes to light because people have misread each-other, make unreasonable assumptions, or steam-rolled over cherished boundaries, then no amount of code wizardry can fix it. The results are anywhere from mild annoyance, up to full-scale “executive level” discussions where contracts are on the line and people can end up feeling run-over. In extreme cases, as in the industry’s latest rounds of RIFs, or the banking industry turmoil, it can mean jobs. All of this can be tremendously disruptive to productivity, hurt technical progress and stifle creativity.

Since these kinds of problems are at their heart people, there’s no CAD tool or programming language, instead you get inputs via emotional cues, and operate through influence and trust - in-essence meta-communications. In computation this is conceptualized as “Out Of Band” management, except that with people these pathways operate in-concert with, but not always subordinate to the more obvious domain of logic. As such, the rules are malleable - how to discern meanings, how to agree on assumptions, and how to recognize and respect boundaries - these can be drastically different between people, teams and organizations.

That makes the learning environment ‘wicked’:

In wicked domains, the rules of the game are often unclear or incomplete, there may or may not be repetitive patterns and they may not be obvious, and feedback is often delayed, inaccurate, or both.

– Epstein, David J.¹

Wicked domains are everywhere, and they defeat all kinds of strategies that try to pre-plan and predict everything, but there is a powerful tool that programmers learn (usually the hard way) which can conquer these domains, turning the nature of mismatching expectations from crisis into opportunity. Amazingly many highly talented technical people hardly notice, and rarely make use of in a determined way. In both code and in life this very essential tool it is treated as failure.

I first encountered this idea well formulated in another domain - sports psychology - where the great and accomplished athletes exhibit a very different reaction to a not all that dissimilar environment. These wicked domains are where we discover the necessity of ‘Grit’ which Angela Duckworth calls the thing that separates leaders and winners from the pack. In this case, the feedback you get from people when dealing with complex technological projects, particularly negative feedback, might be the most important data you can get. If you believe Duckworth, experts become that way because they see it as a useful kind of data:

As soon as possible, experts hungrily seek feedback on how they did. Necessarily, much of that feedback is negative. This means that experts are more interested in what they did wrong—so they can fix it—than what they did right. The active processing of this feedback is as essential as its immediacy.

– Duckworth, Angela. Grit²

However, what I’ve seen in industry is that most everyone at some point has been trained to see negative feedback… negatively, discouragingly - as a sign of failure. Yet if Epstein and Duckworth are right, you gain more from negative feedback and ‘failed’ experiments than from easy success. Again, contrast this viewpoint with the sanguine relationship that exists at the top levels in another brutal and ‘wicked’ domain - sports. In recent news, an NBA star was asked about whether a loss in the 2nd round of the playoffs was a failure - and his response, while controversial to fans, was a near perfect summation of the situation:

“Michael Jordan played 15 years, won 6 championships. The other 9 years were a failure?"

-Giannis' passionate response to if he considers the Bucks' season a failure pic.twitter.com/G5VtwnGXYq
— NBA on TNT (@NBAonTNT) April 27, 2023

That leads to the question: Can we in technology train to turn negative feedback to our advantage? Is it possible to change the way we think about negative events so that they are something other than failure?

Prediction is Hard #

In his 2007 book The Black Swan: The Impact of the Highly Improbable³, author Nassim Nicholas Taleb tries to illuminate a common human heuristic that can have disastrous results: Expecting the future to be predictable from past events. If you’ve only ever seen swans in your neighborhood pond that are white, a black swan appearing one day would seem to be an entirely un-predictable event. In reality, there’s missing information, the true distribution of swan pigments, but most of the time you can take your personal experience with ponds and birds as a good model. This space, where experience can substitute for absolute knowledge, he calls ‘Mediocristan’ - where events can be said to have some nice orderly mathematical distribution like a Gaussian model.

On the other hand, the real universe can deal some nasty surprises, from civil-wars to terrorist events, and while unusual and unlikely, they really do happen, and we rarely accurately predict them. In a world where extreme events occur (Taleb calls this ‘Extremistan’) a nice orderly little gaussian statistical prediction can be woefully inaccurate, even dangerous. Depending on these models can have disastrous results, like the global financial meltdown of 2008 where financial models were themselves a culprit in bankruptcy and dysfunction.

In a follow on book Antifragile: Things That Gain from Disorder⁴, Taleb looks at systems that can successfully survive black swans and other kinds of negative events. From biological systems, to financial hedging, Taleb analyzes systems that respond to external stresses in creative ways to make themselves resilient. A very common tactic, not at all unlike Duckworth’s idea, is to create signals corresponding to stress, and to incorporate these signals into preparing for tomorrow - you might recognize this in strength training, your body detects the stress of weight-lifting, and signals muscular stress, which is resolved by adding muscle mass, in effect becoming stronger from the stress, or “Anti-Fragile”.

The Concept of Error #

Take a moment to consider the humble error message.

File not found.

Errors really are just another kind of feedback. In fact they are very useful information, but like many other kinds of information, using them successfully means being prepared to process them. One design pattern all programmers are familiar with is the idea of a NULL return value signalling an error, using this return value is unpredictable and usually catastrophic, but some careful code writing allows you to explicitly check for invalid returns:

p = open_some_file(filename);
if (NULL == p) {
  // ERROR!! HANDLE IT
  switch (errno) {
    case ENOFILE:
      // handle the error
  }
  exit(-1);
}

Note that there are three steps required for this to work, an explicit forecasting of the possible kinds of errors, second a real-time error detection mechanism, and finally in the error case, a non-trivial effort to deal with the consequences. In the ‘real-world’ domain the this is analogous to a “post mortem”. Some companies employ this very same tactic - often in a ritualized form with checklists and meetings, though I’m skeptical that such heavy-handed processes actually work to make the available feedback useful.

With the benefit of many decades of systems programming, one might conclude that paradigm of “predict/detect/decode” is intrusive and overbearing. Wiser and older programmers have already seen this and enriched the world of programming languages, in modern ‘safe’ languages mandatory prediction is replaced with something more flexible. Starting with C++ we see a much more flexible way to detect and deal with unexpected results: the ‘exception’ mechanism:

try {
  p = open_some_file(filename);
}
catch (FileNotFoundError e) {
  // handle the problem
}
catch (GenericException e) {
  // gracefully handle something unexpected
}

Under the errno paradigm, errors have to be defined in advance. In our exception case, we have an opportunity to gracefully handle any non-fatal error in some way, even if we don’t know what that error type might be in advance.

Now the library or Operating System author is responsible for throwing or raising some kind of “out of band” signal when something abnormal is happening. If the programmer is following good practice and catching most exceptions, then most faults are temporary and gracefully handled in proximity to the issue. Even an unexpected/unpredicted fault will result in a richer and useful output:

Unhandled exception at 0x8380FF0A in Program.exe: Microsoft C++ exception: std::bad_alloc at memory location 0x000EEEE. occurred

Even these “unhandled” exceptions can be caught at some higher level, allowing a program to gracefully closeout critical resources and prevent files from being corrupted. There’s still some level of prediction required, but now we’re focused on global fault-tolerance and recovery, instead of trying to forecast every possible error at the source.

Becoming Anti-Fragile #

Consider Postel’s Law which is a ‘fault-tolerant’ way of interfacing between software components. It’s also a good way to communicate with people, accept as much input as possible, and be precise about your output, when you cannot generate useful information for someone, throw an exception, an error, or otherwise signal what’s happening.