from Guide to the Job Hunt on Jul 16, 2023

How to succeed at AI design interviews

AI design interviews are similar to system design interviews, where you're given a generally broad prompt — only now, the prompt involves a model of some form, such as "Track a football".

Being a design-focused interview, many of the overarching principles in an open-ended AI design interview are similar to system design interviews — in short, you're brainstorming ideas with a future colleague, and your role is to organize, distill, and drive the conversation.

Prepare for the unspoken agenda: Ask, discuss, code, test.

The unspoken agenda is very similar to the coding interview's. Let's walk through the general structure of an AI design interview's response. Knowing the unspoken agenda will make you seem more prepared and your proposal more cogent.

Interviewer poses question. Interview provides a very simple prompt, such as "Track a football".
You ask questions. You clarify the scope and requirements for the problem.
Simple baseline approach. Define the task — what data, model, metrics. The quicker to a prototype, the better.
Iterate on design. Build up in complexity, possibly by addressing interviewer feedback.
Monitor post-deploy. Discuss how to monitor, evaluate, and improve model performance in production.

Let's now break down each of these steps in detail.

Knowing the unspoken agenda will make you seem more prepared and your proposal more cogent.

Step 1. Interviewer poses question. Interviewer poses a simple short prompt

They may provide a few requirements, such as the inference speed of the model, to get you started. As we'll discuss in the next section, you should be prepared to ask for these statistics — or make up some — even if they're not provided.
As opposed to coding interviews and behavioral interviews, the design interview at this point is completely open-ended. It's easy to get caught off guard. Half the battle is knowing where to get started in your response, so this framework for your response is important to know.

Step 2. You ask questions. Ask a few standard questions to understand the scope of the challenge. For example, predicting in real-time on a mobile device is far different from maximizing throughput for a model in the cloud.

Clarify end usage. Understand what the predictions are being used for, in the end product.
1. The final usage will determine what outputs your model ultimately needs to provide. For example, "tracking an athlete" could mean segmentation, detection, or keypoint estimation.
2. Say you're tracking a football as above. If the end goal is to highlight the football in a live sportscast, then your model should segment the football pixel for pixel. Alternatively, the goal may be to crop the live camera feed to focus in on the ball carrier; if that's the case, then your model could simply detect bounding boxes instead.
Clarify available sensors. Understand what your model can accept as input reasonably.
1. The sensors can drastically change the difficulty of your modeling task. In some cases, you're allowed to setup the sensor system, picking both the sensors themselves (e.g., infrared vs. RGB camera) and their placement.
2. For example, to track a football, one way to reduce the problem is to place a micro-pressure sensor on each athlete's palm and bicep. Although there will be some false positives for each sensor individually (e.g., the player shoving another player), there are hopefully fewer false positives for both sensors being engaged simultaneously. This isn't realistic given the size of pressure sensors today, but it's one way to (attempt to) use sensors to reduce the problem.
3. Another example would be to place an RFID tag on each player and an RFID reader in the ball. The violent handling of the ball may damage the reader, but a player in possession of the ball for a period of time could possibly be identified. As silly as these ideas are, shortcuts are useful for both simplifying the problem and for collecting accurate ground truth data alike.
Clarify setting. Understand where and how the predictions are being used.
1. The setting determines resouce constraints that are imposed on your model. Will you need to predict in real-time? What resource constraints — latency, power, memory, storage, etc. — does your model need to meet? Do you have access to server-grade GPUs or just on-device compute?
2. Say you're still in the previous example and have now decided to crop a live camera feed. If you're working for NBC and they're looking to broadcast the game, you can host a segmentation model in the cloud on a beefy server. However, if you're working for a startup that's got a hot new mobile app to process the user's camera feed, you likely want to run a local model for real-time updates to the camera preview.

Step 3. Simple baseline approach.****Design a minimal baseline that can be understood and tested quickly. Your interviewer may egg you onto a more thorough solution, but your goal is to touch on as many concerns as possible, to see which your interviewer wants to hear more about.

We discussed some radical ways of simplifying the problem above. Here, you're not simplifying the problem but are instead setting up the infrastructure. Touch on all aspects of the system you'll need to setup.

Data. Discuss how you'll collect data. It could be web-crawled and human annotated as a start.
1. Are there ways to obtain annotated data for "free"? In the ideal case, there's a pre-curated dataset available, or a related dataset you can cleverly repurpose. Pseudo-labels using a pre-trained model are certainly possible, but these are noisy. Using additional sensors for ground truth as we discussed above is another way. Or, there may be unannotated data you can pretrain on.
2. If "free" ground truth is not available, are there ways to make annotations more cost-efficient? For example, label every 10 frames in a video and use simple kinematics models to propagate labels across frames. Alternatively, ask annotators to refine pseudo-labels instead of annotating from scratch.
Model. Discuss different possibilities for the model and associated architecture. Start simple with a known and popular architecture.
1. Is there a pretrained model for a related problem you can use? For example, an action detection model for football player, or a volleyball tracker. Discuss how you would fine tune this model to detect footballs instead. For example, the action detection model would need localization information added, possibly with CoordConv.
2. Is there an architecture that is tailored to this problem? Identify a property of the data or task unique to this task and how the architecture is adjusted to handle that. For example, football players are often all rushing towards the ball. Our model thus may need to leverage context up to a certain distance to help determine where the ball is.
Loss. Describe the loss function you would use to train the model on the provided data.
1. Classification would be straightforward; use cross entropy. However, you have an array of different options available to you if you're first pretraining on unlabeled data. Additionally, you may add different regularizers to training, based on the problem.
2. For example, the football is often and easily occluded by players, either partially or even fully. To handle this, we may randomly mask and infill parts of the ball during training to imitate occlusion, to make the model more robust to real-world occlusions. Here, you should also note that the infill may be unrealistic, instead giving the model extra signal about where the ball is — this would be an unrealistic signal that training data provides, that is not present in test data.
Metric. Describe how your model would be evaluated.
1. This could be interesting even for classification. As a start, you can use accuracy as a simple metric. However, you may be more interested in false positives than false negatives. Or, you may be interested in a classifier with a tunable false-positive rate, meaning you now need an ROC curve.
2. For example, there's only one football on the field at any given moment. One metric would be how often the model predicts more than one football on the field; this is clearly wrong and an example of a catastrophic failure.

At this point, your interviewer may already be poking and prodding at your proposal, so your interview is most likely going to naturally segue into the next step.

Step 4. Iterate on design. Any part of your above proposal can be iterated on, either to improve quality or to better meet project requirements.

Before iterating, do your best to pause after the initial version is proposed and explain how your proposal meets criteria. If your model needs to be real-time on the edge, emphasize which parts of your model allow it to run at 30 frames-per-second with minimal resource constraints.
You can leave out requirements in the initial version, as long as you acknowledge them. This may be for a particularly challenging requirement, for example, such as a low power constraint. In this iteration step, then brainstorm ways to handle this requirement.

Step 5. Monitor post-deploy. After deploying the model, discuss how to monitor and evaluate quality of the deployed model. The goal of your monitoring should be to uncover cases your model performs poorly on.

Monitoring is non-trivial as it's a "you don't know what you don't know" problem. It's often impossible to know what edge cases to expect. As a result, it's critical to build a pipeline that catches as many of these as possible.
For example, you could store all low-confidence predictions. This assumes that low-confidence is equivalent to out-of-distribution examples. Another possibility is to sample random frames, annotate them, and collect the incorrectly-predicted frames. Neither are particularly efficient but are generic approaches.
In the above football example, one natural way to identify out-of-distribution examples is actually to transcribe and use information from the sportscaster. Verbal cues may be enough to identify visually-confusing passes or lack thereof.

In summary, prepare for the above stages in your design interview. You'll be much better equipped in your own interviews by just knowing the sequence of steps.

Your Rubric: Background, Research, Production

There are a variety of ways interviewers can throw curveballs at you in the design interview. This is done to assess three categories of knowledge and skills. Different teams and companies may use this interview differently, but you can expect the following rubric items in some form.

Apply background knowledge: You should know your basics in back-propagation and transformers, sure. You should also know when to apply what. Although testing your knowledge isn't the point of the interview, testing your application of existing knowledge is. For a primer on transformers, see Language Intuition for Transformers.

Your prompt is to "Classify the type of detected car" into one of several classes, based on speed — sports car, family car, etc.
- You propose predicting $k$ logits, one for each class; softmax these logits to produce $k$ probabilities; then, take the highest probability and output the class that probability corresponds to. The interviewer may then ask: You discover that softmax is extremely slow to run on your deployment hardware³. How do you handle this?
- This involves understanding (1) what softmax itself is and (2) why softmax is applied in this case for the forward pass. See the answer left in the footnotes¹. Without spoiling the answer, an understanding of what softmax is, is needed for both the best and second-best answers.
Your prompt is to "Speed up Large Language Model inference". Say you're applying for a senior role. In which case, you should be familiar with why different methods are being applied — don't just memorize keywords.
- Say you propose applying Flash Attention to reduce latency. One question an interviewer may ask you is: How does Flash Attention reduce latency? Critically, why and when should you use Flash Attention, and when should you not? The answer is left in the footnotes².
- This involves both (1) understanding how tiling works for matrix multiplication and (2) how Flash Attention applied tiling to Large Language Models. We covered this pair of particular topics in How to tile matrix multiplication and When to tile two matrix multiplies. This is important, because without this knowledge, your "optimization" may actually hurt latency.

You should be familiar with why different methods are being applied — don't just memorize keywords.

Conduct research efficiently: You should understand how to break down a complex problem into small, answerable hypotheses. Those hypotheses should then be answered as directly and quickly as possible. For more details, see What defines a "good" researcher?

Say you're still working on "Classify the type of detected car". This time, you're applying for a senior role. In which case, you should be able to debug a black box. e.g., a model that is underperforming. You can learn more here How to debug black boxes.
- You propose using ViT-L as backbone, then fine-tune adapters⁴ (a.k.a., a few extra fully-connected layers that you inserted) to classify the detected car type. The interviewer posits that your model underperforms tragically, achieving ~40% accuracy for a 10-class problem and asks how you would handle this low accuracy.
- As we discussed in the debugging post above, the interviewer is looking for an attempt to diagnose the problem before suggesting fixes. You could list 20 ideas to fix the issue, but if there's no hypothesis for the underlying cause, you'll run out of ideas before seeing accuracy improvements. Then, you need a quick and simple way to test this hypothesis.
- One simple hypothesis is that the model is overfitting. Bonus points if you can justify why: In practice, transformer-based vision models require heavy regularization, and on top of that, Vit-L is large, potentially overly so for such a simple 10-class classification problem. There's a simple and quick way to check: Look at the train-val gap.
- A more advanced hypothesis is based on the data. For example, you suggest visualizing misclassified vehicles and a trend you may uncover. One plausible error would be misclassifying all blue cars, grouping them all under one class. Say your system is vision-only. Then, it makes that color is the only signal and one that your model relies heavily on. You may then suggest other ideas, such as incorporating estimated speed for the vehicle or other sensory input. This would earn extra points for thinking beyond just the model itself.

Aware of production needs: You should be aware of real-world deployment concerns for a model — how practical it is to obtain a certain kind of data, how finicky or stable a model is to train, and possible concerns with a model's performance. In short, be able to anticipate and plan for obstacles.

Say you're still working on "Speed up Large Language Model inference" and that you're again applying for a senior role. In which case, you should be able to draw on experience to think bigger picture and connect research to reality.
- You report a particular tokens-per-second latency, measured by initializing the model, starting the timer, running 10 prompts, and stopping the timer. The interviewer reports that you achieve 30 tokens-per-second with this approach and yet, users are finding that the LLM is very slow to respond. Why might this be?
- In this case, your experience should tell you that metrics aren't always trustworthy and may need redefining. In this case, the latency measurement is missing two key parts.
- First, the latency measurement doesn't include cold start time, which may makes the model appear unresponsive while it's being initialized. The user may open your application, then navigate away to a compute-intensive application or game. In which case, your model is evicted from cache or possibly kicked from DRAM. This means that when the user navigates back to your application, the model is reinitialized — effecting a cold start.
- Second, latency for the first token and for subsequent tokens differs drastically, in a decoder-only model. These two latency measurements are lumped together above but perhaps shouldn't be. In short, the user's perception of unresponsiveness is based on the time it takes for (1) the model to initialize and (2) to generate the first token. The 30 tokens per second number above doesn't tell us how unresponsiveness the model may appear.

The above rubric isn't exhaustive, but this is the general gist of the evaluation.

Practice, practice, practice.

Throughout this post, we introduce a large number of examples and tips. The top tip, however, is none of those. Instead, it's to practice. AI design is so unlike any other interview and job that any amount of related practice is better than none.

Practice thinking aloud. As we discussed in How to succeed at coding interviews., practice talking and thinking at the same time. This is a lot harder than it sounds, but it's simultaneously very critical. The entire interview is one large brainstorming session.
Practice organizing brainstorming. The difficulty is in (1) producing new ideas, in the brainstorming process and (2) simultaneously keeping that suggestion in context, for how it addresses one of your requirements. However, this is an important skill. You can practice this by incessantly re-summarizing takeaways, throughout the brainstorming session.
Practice finding shortcuts. Continuously find ways to simplify the problem. You'll need shortcuts to obtain cheap ground truth data, solve the problem in a more elegant way, or to improve the model's performance. Many times, these shortcuts may be rejected by your interviewer; perhaps they're looking to discuss a specific topic, such as inference optimization. However, a clever shortcut can work wonders for the elegance of your solution.
Practice discussing tradeoffs. With any design choice you make, discuss the tradeoffs for that choice. Your new grouped convolution may for example reduce the latency but come at the cost of quality. You may also expand the sensor suite to collect more information, but this comes with a higher risk of sensor mis-calibration, in your collected data.

Here are AI design prompts that I encountered in actual interviews:

Track a tennis ball. You determine what hardware to use — what sensors to use, what compute you have access to for training and inference.
Your self-driving car has 4 long-range LiDAR sensors. You're now adding 2, short-range LiDAR that can detect closer objects. How do you train an object detector for these new LiDAR sensors?
You're designing a deep learning framework for inference optimization. How would you organize your library? What is the API for using your library?

Now, you know what to practice and how to practice. Grab a colleague or a friend, and practice brainstorming together. Even if you can't find a friend to practice with, practice the tips above on your own.

← back to Guide to the Job Hunt

There is a correct answer. You may have noticed: The softmax is completely unnecessary during inference. Softmax is a monotonically increasing function, meaning that the relative ordering of inputs is unchanged. This means the argmax of the softmax'ed inputs is the same as the argmax of the inputs. The solution here is to simply drop softmax. With that said, any clever approximations of softmax are a reasonable but not the ideal answer. ↩
As we discussed in the linked posts, Flash Attention reduces latency by jointly tiling two matrix multiplies. Ultimately, this allows us to skip reads and writes for the intermediate output of the first matrix multiplication. At a high level, the number of reads-writes we save for that intermediate output must exceed the number of additional reads-writes we incur for our weights. We fully derived an explicit expression for this tradeoff in When to tile two matrix multiplies. ↩
I'm actually not making this up. Softmax is known to be inefficient, as Grave et al discuss in their paper "Efficient softmax approximation for GPUs". ↩
This notion of adapters is taken from Houlsby et al in "Parameter-Efficient Transfer Learning for NLP". In short, insert an "adapter" — few fully-connected layers — after the attention but before the MLP in a transformer. Freeze the rest of the model, and fine-tune just these adapters. ↩