from Guide to Hacking on Nov 05, 2023

How to be a "good" AI-powered debugger

One common question is: Will ChatGPT render coding obsolete? The truthful answer is — yeah, kinda. At the very least, it obsoletes coding as we know it. However, the need for code doesn't go away.

We still need coders, but all coders — even junior ones — will need to learn how to curate, edit, and review code instead of writing from scratch. In this post, we'll discuss the core elements of a good AI-powered coder and what to look out for as a code curator.

The flip side is true too: Coders can all benefit from AI-generated code, to spend less time typing and more time making higher-order decisions about the code. This is a significant productivity boost, in and of itself.

How code-less coding works

In theory, asking ChatGPT to write some code is straightforward:

Prompt ChatGPT with the problem setup. Ask it to generate a code sample with certain specifications.
Copy the generated code back into your codebase, and run.
1. If it succeeds, then we're done.
2. If there's an error, copy and paste the stacktrace directly into ChatGPT. Repeat step 2 until there are no more errors.

At a surface level, it seems coding is "solved". Unfortunately, with this approach, your prompt is lacking a lot of much-needed context. Even just 500 lines of omitted code could mislead ChatGPT to produce a suboptimal code modification. Maybe an already-implemented function is partially re-implemented elsewhere, or an abstraction barrier is torn apart.

Someone needs to have context — either you or the model. Issues like this one aren't insurmountable, but these are aspects of AI-powered coding that need human supervision and curation.

Below, we'll talk through a general coding principle, then discuss specifically how to exemplify that principle even when working with models such as ChatGPT. For issues like context above, the predominant AI-powered workflow is extremely prone to poor coding practices.

Principle: Fix the cause, not the symptom

There are many ways to silence an error or exception in your codebase, and the worst of these ways can degrade — not enhance — your codebase. These "fixes" are what breed unwieldy, unpleasant code, and unfortunately, a ChatGPT prompt with limited context leads to exactly this. First, let's understand properties of desirable and undesirable bug fixes.

"Fixes" you don't want: At one end of the spectrum, you have monkey patch "fixes" that contribute to code rot.
- This could be a passthrough exception handler that muddles stack traces. Others may be similarly surface-level patches that damage readability and extendability.
- The core of the issue is these "fixes" are highly localized. They address symptoms, such as a missing key, but don't address the root cause — why the key was missing in the first place. In short, an understanding of why the error happens is lacking.
- This is all too easy to do. Prompt ChatGPT with just the function you're writing and the key error. The fix inadvertently ends up being highly localized; the model simply doesn't have enough context to suggest anything other than a localized fix.
Fixes you do want: At the other end of the spectrum, you have bug fixes that truly address the root cause.
- In short, these code changes address data validation and sanitation as early as possible; they change core parser behavior while maintaining an unchanged API; or they define an improved abstraction to enable new features.
- The core requirement is that you understand where the error comes from and address it there. The fix could be to make assumptions clearer with assertions as early as possible or catching and handling invalid data as soon as possible.

In short, bug fixes should come after a thorough understanding of what's causing the issue. If you don't know why it's broken, you won't know why a fix works — if at all. And it will always come back to bite you. Even if these fixes handle errors, successive monkey patches start to eat away at your codebase's scalability and maintainability.

We talk about debugging thoroughly and quickly in How to debug black boxes. In short, you can debug faster by isolating as many components as possible — for example, by bisecting commits to isolate changes or mocking objects to isolate parts of a library.

Tip: Curate context for AI

Let's see this misdirection from AI in action. Below, we'll plug in two programs into ChatGPT, and ask it to help us debug. In both scenarios, we'll show a prompt with insufficient and then sufficient context. The idea is to get a rough idea of what context you and AI both need to correctly diagnose and fix the issue.

Example. Say you're writing a command-line program to download soccer game results. After running the program, you get the following traceback.

Traceback (most recent call last):
  ...
  File "path/to/file.py", line XX, in add_score
    stats[team] += num_goals
TypeError: unsupported operand type(s) for +=: 'int' and 'NoneType'

You find that the relevant part of the program looks like the following.

def add_score(stats, team, num_goals):
    stats[team] = stats.get(team, 0)
    stats[team] += num_goals

We clearly need to handle this incorrect type. We expected an integer and instead got a None type. There are two possible approaches:

Approach #1. One way is to handle the None behavior explicitly; if the data type for num_goals is not an integer, don't do anything; early exit from the function and call it a day. Alternatively, since we're incrementing subtotals by num_goals, we could set num_goals to 0 if it's got an invalid datatype, without altering the function's behavior. Here's an example fix by hand:

def add_score(stats, team, num_goals):
        if num_goals is None:
        return
    stats[team] = stats.get(team, 0)
    stats[team] += num_goals

Let's plug the above into ChatGPT and see what it suggests, for a bug fix. We prompt our model with the above snippet and traceback; the returned program, according to our chat log, is the following.

def add_score(stats, team, num_goals):
    if stats.get(team) is None:
        stats[team] = 0

    stats[team] += num_goals

ChatGPT assumed stats[team] is None, and above in our manual fix, we assumed num_goals is None. These are both reasonable assumptions, but no matter how hard I tried, I couldn't get ChatGPT to realize that num_goals could be the problem. In any case, ChatGPT doesn't properly debug the issue for us, so let's move on to fixing this with a better prompt.

Root cause. Upon closer inspection, the above data type error should look familiar for a CLI. Namely, Python's argument parser will return None when no value is passed to an optional argument. Whether the fix above came from our reasoning above or from ChatGPT, it has the same downfall: Since this parsed num_goals value is shared globally, this None masquerading as an integer may be used elsewhere.
Downfall for addressing symptoms: By addressing only the symptom, our "fix" above causes cleanliness problems. To accommodate this rogue None, we may duplicate that None-handling code elsewhere, when the next person — possibly yourself — runs into the same issue. In short, by not addressing the root cause, we set the foundation for redundant "fixes".

Approach #2. If we address the root cause directly, our bug fix can potentially be much simpler: Set a default integer value for that optional flag, like below.

parser.add_argument('--num-goals', default=0)

We only knew this because we dug deeper to find the original definition of num_goals. Let's prompt ChatGPT again, but this time, with the additional context we used to make a more robust fix. According to the chat log, ChatGPT identified the same root cause — that an optional argument wasn't supplied — but provides a different fix — enforce an int type and give up if the user doesn't supply a value.

parser = argparse.ArgumentParser()
parser.add_argument('--num-goals', type=int)  # Specify the type as integer
args = parser.parse_args()

# Check if args.num_goals is not None before passing it to add_score
if args.num_goals is not None:
    add_score({}, 'hello', args.num_goals)
else:
    print("Please provide a value for --num-goals.")

With that said, ChatGPT's approach effectively makes --num-goals a required argument, which is already supported by the argument parser. If --num-goals is rewritten to be mandatory, the snippet above should instead be the following, to use argparse's required argument feature.

parser = argparse.ArgumentParser()
parser.add_argument('--num-goals', type=int, required=True)

So now, ChatGPT's debugging has really failed us in two ways rather than just one:

In the example for Approach #1, ChatGPT provided a localized fix that addresses only the symptoms, leading to potentially redundant code down the line. This wasn't really the model's fault though, because we didn't provide sufficient context.
In the example here for Approach #2, ChatGPT actually identified the parser to be the root cause but then proposed a solution that was overly verbose, by re-implementing features already available in a Python built-in library.

Ignoring the second failure for now, the trick is to provide context to AI. Even if the fix isn't the cleanest, it at minimum could work and identifies a set of possible root causes to investigate. If you dig through the last ChatGPT chat logs, you'll find that even with the right context, the model actually incorrectly identifies the root cause but at least suggests parsing as the issue.

There's some finesse in understanding what context to provide to AI, when debugging. Too much context may simply be cumbersome to collect and prompt — if it doesn't already exceed the model's limitations on context size. For example, here are a few suggestions:

If you encounter a type error, include code snippets where your input variable is altered or defined. Note that "getters" may sometimes provide default values, such as {}.get('a') producing None.
If you encounter an index error, include code snippets where your data structure is created or updated. Any slicing operators, pops, deletes, or insertions. If keys or indices are variables, recurse on those variable definitions as we discussed above.

Naturally, the above tips are useful for you as a developer to find the bug yourself. However, as we saw above, this is also important context for the model to leverage, as a part of its debugging suggestion.

Takeaway

Above, we touched on a number of different factors that influence code cleanliness — re-implementing features from a standard library, addressing type errors as soon as possible, among others — but they all share one characteristic: Clean code reduces redundancy. There are limits to this that we'll discuss later, but in short, reduce redundancy by:

Understanding what functionality standard libraries and your own utilities already provide. Not all de-duplication is desirable, but most are. For example, we above used the required=True keyword argument for the argument parser, instead of manually enforcing an optional argument as required.
Centralizing redundant implementations in a utility, and ensuring that the utility is updated directly, instead of the utility's consumers. For example, we may find many separate lines of code that check if a score is an integer and is non-negative, such as assert isinstance(score, int) and score > 0. Let's move that into a utility, assert_valid_score. Furthermore, if we have future checks for score validity (e.g., that it's a multiple of 15), we should add those checks to assert_valid_score.

So in conclusion, working with AI can certainly produce working code that addresses the error you presented. However, it's not always perfect, and the true value of AI is it's ability to provide suggestions and directions. In our examples above, AI-generated code fixes code but isn't perfect; it's our job to ensure AI-generated code reduces redundancy just as a quality coder would. More generally, it's your job to curate AI-generated code.

← back to Guide to Hacking