from Guide to Hacking on Dec 04, 2022

How to debug black boxes

Debugging can be difficult when you're working with a black box — opaque either because its an undocumented mess or because its a wholly uninterpretable neural network.

Debugging is a pain, but it's largely a structured process. What's more — this process doesn't change a whole lot across roles. You could be a front-end engineer or an applied researcher. In any capacity, your debugging approach remains the same in principle. We'll start with some general debugging principles, then discuss common mistakes that unnecessarily prolong the debugging pain.

What you need: Expected vs. reproducible, actual behavior.

This section establishes several assumptions and basic assumptions. It may be the first time you're seeing them codified, but if you're an experienced coder, this would be also already be your bread and butter. To successfully debug, we need to satisfy three prerequisites:

Know the desired behavior. We need to understand what how the system is supposed to behave. For example, if we're debugging a web application, we should understand which page should load. If we're debugging a training algorithm, we should know what accuracy to expect.
Know the actual behavior. Be able to describe the actual behavior. For example, if we're debugging a mobile application, be able to relay the error message and code in the popup notification. If we're debugging a data augmentation, be able to describe the desired effect on the image.
Reproduce the erroneous behavior. Be able to repeatedly trigger the same erroneous behavior. For example, if we're debugging a game, we should have the right commands, or series of UI interactions that produce the same undesired behavior.

If these prerequisites² are not met, the issue can't be fully debugged¹.

For now, let's assume these prerequisites are met. With these, we can then move forward with isolating the error.

Find the "root cause".

We hear about "root cause" quite often, and it's a pretty hackneyed term. With that said, we'll break down what this means using program bugs of different levels of complexity. To find the bug, we have two options:

Isolate the erroneous input. Test varying inputs and configurations until we find the one knob that results in undesirable behavior. At that point, the problem could be user error or simply invalid input your application should catch sooner.
Isolate the erroneous lines of code. Effectively, perform a binary search to find the lines of code causing issues.

Note that this binary search, when searching for erroneous lines of code, occurs in one of three ways:

Using the error traceback, navigate directly to the line of code causing problems. Not much of a search.
Search the code. This could range from print statements, to proper debugger checkpoints, to commenting out swaths of code.
Bisect commits. If you had a working version of the codebase, find the last version of the codebase that fully worked. This happens often with production code, where version control is mostly adhered to. However, this doesn't always exist in practice — either you've made changes that aren't yet committed, or you're not working in a strictly version controlled repository. Knowing this, we won't cover commit bisection, as it's conceptually straightforward.

In short, our first steps in debugging a program involve the following:

Understand desired vs. actual behavior.
Reproducible undesirable behavior.
Isolate undesirable behavior to inputs or code.

Here are two short programs and example trains of thought, for debugging these programs.

Search Method #1: Isolate code, using traceback.

Say we have the following example program, saved in main.py:

main.py

fruit_to_price = {'apple': 1, 'banana': 2}
fruit_to_price['coconut']

Running this Python code would result in the following error.

output

Traceback (most recent call last):
  File "main.py", line 2, in <module>
    print(fruit_to_price['coconut'])
KeyError: 'coconut'

This is pretty straightforward: KeyError means that our key, 'coconut' is not a key in the provided dictionary. As we can see in the traceback above, line 2 in our file main.py tries to access the 'coconut' key, so this is the line that needs fixing. This is the first search method, where the line of code throwing an error is exactly the root cause.

We now know line 2 is the problem. We can fix this by changing the key to one that exists, such as 'apple'.

main.py

fruit_to_price = {'apple': 1, 'banana': 2}
fruit_to_price['apple']

Search Method #1: Isolate code, by following traceback.

Save we have the following program now.

main.py

fruit_to_price = {'apple': 1, 'banana': 2}

def get_price(fruit):
    return fruit_to_price[fruit]

get_price('apple')
get_price('coconut')

Running this Python code would result in the following error:

output

Traceback (most recent call last):
  File "main.py", line 7, in <module>
    find_price('coconut')
  File "main.py", line 4, in find_price
    return fruit_to_price[fruit]
KeyError: 'coconut'

Just like before, KeyError means that our key, 'coconut' is not a key in the provided dictionary. However, we now need to find out which part of the code needs fixing — in other words, which line of code starts passing around the 'coconut' key.

First, reference the last line starting with File.... This tells us line 4 in main.py is causing problems, but this doesn't give us the full story. All we see is return fruit_to_price[fruit]. What is fruit?
Reference the second-to-last (or first) line starting with File.... This tells us line 7 in main.py was the previously-executed line of code. This line find_price('coconut') tells us that fruit was 'coconut'.

Now we know that line 7 needs the fix. We change 'coconut' in that line to a key that exists, such as 'banana'.

main.py

fruit_to_price = {'apple': 1, 'banana': 2}

def get_price(fruit):
    return fruit_to_price[fruit]

get_price('apple')
get_price('banana')

Search Method #2: Search the code.

Sometimes, the traceback is not enough to isolate the error. Say we have the following program.

main.py

def maybesubtract3(x):
    if x > 5:
        return x - 3

def main():
    a = maybesubtract3(7)
    b = maybesubtract3(a)
    c = maybesubtract3(b)
    return c

main()

Running the above program will yield the following error.

output

Traceback (most recent call last):
  File "main.py", line 11, in <module>
    main()
  File "main.py", line 8, in main
    c = maybesubtract3(b)
  File "main.py", line 2, in maybesubtract3
    if x > 5:
TypeError: '>' not supported between instances of 'NoneType' and 'int'

As a start, we can attempt to repeat our process in the previous step. The error means that we incorrectly compared a None object with a number. Let's try to find out which number has magically become None by accident.

Again, starting from the last call in the traceback, we navigate to line 2. This line if x > 5: tells us that x is the offending None object.
In the second-to-last call, we navigate to line 8 and see c = maybesubtract3(b). Here, we see that b is the offending None object.
Rather than continue moving up in the traceback, we should now see where b is defined. In the previous line 7, we see b = maybesubtract3(a). Let's see what's in maybesubtract3 and why it returns None.
If we navigate to the function maybesubtract3, we find an if-else statement: If x is greater than 5, return a number. However, if x is less than 5, it returns nothing. In other words, maybesubtract3 returns None when x <= 5!

That brings us to our root cause: The problem is that maybesubtract3 returns None when x <= 5. Success! This is the "root cause" of the problem.

Now we know maybesubtract3 needs an else condition. Here's an example fix, where we return 0 in the else condition.

main.py

def maybesubtract3(x):
    if x > 5:
        return x - 3
    else:
        return 0

def main():
    a = maybesubtract3(7)
    b = maybesubtract3(a)
    c = maybesubtract3(b)
    return c

main()

Note that we could have debugged this in several different ways:

We could also also set a breakpoint at line 6, right after we define def main():. After the first line, we find that y becomes 4. Then, as we step through the program, we'd find that in the next line, y becomes None. This would bring our attention to maybesubtract3 and to the same conclusion.
We could have also inserted a butt load of print statements, printing out all of the y values. Again, that would have shown us which y was None and brought us to the same conclusion.

In this example, we successfully debugged the program by finding the offending logic in maybesubtract3. Even if the details were muddy, don't worry — the takeaway is to follow the line of clues until you arrive at a design decision in the codebase.

In the sections below, we'll talk about obstacles that make real-world debugging more challenging.

What if the traceback is obfuscated?

For a myriad of reasons, your traceback may be obfuscated or not reported at all. For example, some programs may swallow and never report errors. For any of these scenarios, follow these two steps.

Check if the exception is nested.

First, before working off of nothing, make sure the traceback isn't nested. If an exception is caught, and the handler throws its own exception, your original exception may be buried in several levels of exceptions. Here's an example. Take the following program main.py.

main.py

try:
    raise IndexError()
except IndexError:
    raise NotImplementedError()

After executing this program, you'll find a longer traceback than usual:

output

Traceback (most recent call last):
  File "main.py", line 2, in <module>
    raise IndexError()
IndexError

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 4, in <module>
    raise NotImplementedError()
NotImplementedError

As you can see, there are two tracebacks. The top-most one is the root cause of the problem.

Eliminate traceback suppressors.

Second, there are several common causes of unreported tracebacks, which you can eliminate on a case-by-case basis.

In multithreaded programs, the traceback can be obfuscated, as the main thread is only aware that a thread died — not of the traceback for that thread. To fix this, force a single-threaded program, or force synchronous execution. For example, in PyTorch, to force all CUDA kernel code to run synchronously and report tracebacks normally, use CUDA_LAUNCH_BLOCKING=1.
Poor coding hygiene can lead to swallowed errors, but more often than this, backend systems can swallow errors without reporting back to the front-end. Knowing this, make sure to enable logging and check logs for all related systems. In Python specifically, you can minimize swallowed errors by ensuring that except statements are more specific than except Exception.
In the absolute worst case, perform a naive binary search code on the code or bisect commits. In code, comment out or otherwise eliminate codepaths until the error is eliminated, or until the desired behavior is achieved.

What if the program takes a long time to execute?

In many applications, there can exist a long iteration time — as in, it takes minutes, hours, or days to try a new fix or reproduce the error. For example, you could be running a large production-ready application, or you could be debugging a neural network's sudden drop in accuracy. To address this, you have two options:

Find a proxy input. For a large production-ready application, find the arguments to your subprogram, or find the inputs to your function. If you can serialize some of the input objects (e.g., pickle), use that instead of rerunning your application every time. If you can save request headers or response payloads, read from and write to JSON files accordingly. In deep learning, see if you can train on a smaller version of the dataset — fewer samples, lower resolution, or fewer features.
Find a proxy output. Again for large production-ready applications, instead of checking the final outputted webpage, check the response payload. Determine the expected and the actual responses. For deep learning applications, instead of looking at the final accuracy, check validation accuracy throughout training. Plot the training curve for a correct run and a faulty run. Instead of using a test video at the end of training, visualize validation predictions. Visualize predictions for both the correct and the faulty models.

With enough proxies for both inputs and outputs³, you should be able to significantly speed up your iteration time for debugging. Especially in black-box scenarios for production applications or deep learning, iteration time can be a major bottleneck. Get this right, and debugging will be far less painful.

Avoid deep bugs: Test incrementally.

One common source of nasty bugs lies in how we build software. Namely, don't code the entire project, then test it at the end. Then, you'll spend more time binary searching your project code for the one faulty if-else block. Instead, build the program in increments and test each piece separately. As you add pieces, test different combinations of pieces as you add each one.

Deep learning practitioners should do the same: Instead of building a massive neural network and wondering why it doesn't learn, start with a modular piece that you know should work. It doesn't need a "small" piece with few parameters per se. However, this should be a piece that should obviously work (obvious to you) — for example, start with a fully convolutional single-branch network. Or, start from a model you've previously trained.

By building incrementally, you can save yourself headaches from the get-go, saving debugging time by catching bugs early.

← back to Guide to Hacking

What if the error isn't reproducible? There isn't fantastic advice for this scenario. One possibility is there's an additional input you've missed. For example, maybe time plays a role in the codepath. Failing all else, either you guess the problem with divine insight or related errors lead you to the root cause. The most common scenario for this, is when the error is stochastic. In which case, you may have a race condition (two threads running at the same time) or time-sensitive code (one that behaves differently in am or pm). For the former, you may force a single-threaded program, and for the latter, you may fix the program time. ↩
The above is why every developer support forum asks for the same information: Describe the error, and provide a minimal reproducible example. ↩
For training neural networks in particular, make sure to always always keep tabs on training progress. This means plotting losses, validation performance, and visualizing predictions wherever possible. A long running training job should not be left to its devices; in essence, you must "babysit" your training jobs. This babysitting means that you should be able to lower iteration time when debugging or assessing models. Very rarely — after collecting the appropriate curves — do you need to wait until the end of the training job to confidently declare success or failure. ↩