from Guide to Hacking on Sep 24, 2023
How to use Apple GPUs from Python
Most code samples for using Apple GPUs require Xcode, Objective C, or a combination of both. This post distills the process down to a single file, to demonstrate how you can leverage Apple GPUs from the comfort of Python.
Say you want to write a custom matrix multiply. This matrix multiply could leverage your custom sparse matrix format, operate on a custom data type, or fuse other operations in. Whatever the optimization, you now have a reason to write a custom kernel.
On Nvidia GPUs, it's rather "straightforward"1 to write custom kernels — namely, use CUDA. The toolset for developing and optimizing CUDA kernels is well known, and there are a large number of projects that build and integrate kernels of their own.
However, on Apple GPUs, it's much less straightforward to do this. There are of course plenty of tutorials online, but these demos usually require Xcode, Objective C, or a combination of both. What if I'm working in Python? Don't want to depend on a GUI like Xcode to run a simple script? In short, what is the simplest "Hello world" to interface with custom kernels on Apple GPUs?
Getting setup
Create a new directory to house your project. I will create one on my desktop.
mkdir ~/Desktop/metal
cd ~/Desktop/metal
In this new directory, create a new virtual environment.
python -m venv env
source env/bin/activate
We now need to install pyobjc
, a Python binding for many of the OSX's built-in utilities, including an API that we'll use to run custom kernels on Apple GPUs.
pip install pyobjc
Let's now dive right into the code.
Step 0: Write a Metal kernel
We won't dive too deeply into kernel writing. However, we'll scratch the surface here to give you an idea. At its core, kernels are written using Metal Shader Language2 (MSL), which is very similar to C++.
Here's a kernel that takes in an array of integers, and adds 2 to every item in that array, in-place. Create a file called add2.metal
and write the following.
/* import metal library, like `from metal import *` */
#include <metal_stdlib>
using namespace metal;
/**
Define a function called `add2_kernel` that doesn't return
a value. Instead, we modify the input array in-place.
:param in uint8_t*: An array of integers, as input.
:param id uint: The index of the current thread. We use this
to index into the input array. In other words, we
parallelize by assigning each thread to a different
element of the input array.
*/
kernel void add2_kernel(device uint8_t *in [[ buffer(0) ]],
uint id [[ thread_position_in_grid ]]) {
in[id] = in[id] + 2; /* add 2 in-place */
}
Here's another kernel that takes in a single array of floats and computes the log of every element. This time, the operation does not happen in-place. Instead, results are written to an output array. Create a file called log.metal
and write the following.
#include <metal_stdlib>
using namespace metal;
/**
Define a function called `log_kernel` that doesn't return
a value. Instead, we write results to an output array.
:param in float*: An array of floats, as input.
:param out float*: An array of floats, which will contain
our output.
:param id uint: The index of the current thread. We use this
to index into the input array. In other words, we
parallelize by assigning each thread to a different
element of the input array.
*/
kernel void log_kernel(device float *in [[ buffer(0) ]],
device float *out [[ buffer(1) ]],
uint id [[ thread_position_in_grid ]]
) {
out[id] = log(in[id]); /* log each element *not in-place */
}
For this step, we'll import and use this log kernel from the comfort of Python. To do so, start by downloading a file with utilities you'll need.
wget https://github.com/alvinwan/guide-to-hacking/tree/main/metal/v0-hello-world/utils.py -O utils.py
Now, create a short and simple script to load the kernel and run it. Create a file demo.py
with the following contents.
from utils import load
import numpy as np
# create an array, with a single random float value
input_array = np.random.random(1).astype(np.float32)
# load kernel from file, as a runnable python function
log = load('log.metal', function_name='log_kernel')
# run kernel on input array above
output_array = log(input_array)
# check output is correct
error = np.abs(output_array - np.log(input_array)).max()
assert error < 1e-5, "❌ Output does not match reference!"
print("✅ Reference matches output!")
Now, run your hello world script.
python demo.py
The above script will load and execute the Metal kernel, then compare the kernel output to a numpy reference. The comparison should succeed and give the following success message.
✅ Reference matches output!
This now completes your very first, "Hello world" kernel on an Apple GPU.
How to run Metal from Python
There isn't officially a Python API for accessing and running Metal kernels. However, staying in Python-land is still possible by jumping through a few hoops.
In short, we use Objective C's Metal API, which is exposed via Python bindings by the pyobjc
library. The bindings are generated rather than manually defined, so you can expect a one-to-one translation between Python bindings and the original Objective C API.
For example, take the first few lines of sample Objective C code, found in the official "Performing Calculations on a GPU" tutorial.
id<MTLDevice> device = MTLCreateSystemDefaultDevice();
id<MTLLibrary> defaultLibrary = [device newDefaultLibrary];
id<MTLFunction> addFunction = [defaultLibrary newFunctionWithName:@"add_arrays"];
Translating this into Python is fairly straightforward — we can read off each line and write the same code with Python syntax.
from Metal import *
device = MTLCreateSystemDefaultDevice();
defaultLibrary = device.newDefaultLibrary();
addFunction = defaultLibrary.newFunctionWithName_("add_arrays")
However, notice the last method name wasn't a perfect translation, with the Python method newFunctionWithName_
featuring a tailing underscore. This is because pyobjc
replaces all colons in the method title in official documentation (newFunctionWithName:) with underscores. Here's another example. In Objective C, we have the following.
_mAddFunctionPSO = [device newComputePipelineStateWithFunction: addFunction error:&error];
The corresponding method title in official documentation is newComputePipelineStateWithFunction:error:, and as a result, the corresponding Python binding is called newComputePipelineStateWithFunction_error_
. This gives us the following Python translation.
_mAddFunctionPSO = device.newComputePipelineStateWithFunction_error_(addFunction, None);
Using the above, you should now be able to adapt any Metal API resource written in Objective C to your advantage. In this post, we'll cover the basics so you don't need to go digging and translating yourself.
Step 1: Write a launcher from scratch
Above, we used some scaffolding code to get up and running. Let's now remove that crutch and dive into how we load and execute Metal kernels from Python. Create a new file called run.py
, with the following contents — start by importing the Metal API and C datatypes.
import Metal
import ctypes
Then, access the default GPU available to the device, and load the Metal kernel.
# Load the Metal kernel. Kernel adds 2 to input, in-place.
dev = Metal.MTLCreateSystemDefaultDevice() # Get GPU
src = open('add2.metal').read() # Load the kernel source code
lib, _ = dev.newLibraryWithSource_options_error_(src, None, None)
func = lib.newFunctionWithName_("add2_kernel")
Now, we have the Python representation of our kernel in func
. There are three steps we need to follow, to execute the kernel.
- Prepare input arguments.
- Construct a command that determines how to run the kernel.
- Execute the command.
Let's start with the first step, where we prepare a buffer for the first and only input argument to the kernel. Our array in this case has only 1 value. To be specific, our array occupies only 1 byte, for the one 1-byte value it contains.
# Create input buffer. Initialized to all zeros.
storage = Metal.MTLResourceStorageModeShared
input_buffer = dev.newBufferWithLength_options_(1, storage)
For the second step, we follow a few sub-steps:
- Define a queue of commands.
- Create a command.
- Setup the command to run our kernel.
- Setup the command with the right input arguments.
- Specify how to parallelize the kernel across inputs. We have only one item in our array, so we use just one thread.
- End defining the command.
# Define a 'command' that specifies how to run the kernel
commandQueue = dev.newCommandQueue() # queue of commands
commandBuffer = commandQueue.commandBuffer()
computeEncoder = commandBuffer.computeCommandEncoder() # start
pso = dev.newComputePipelineStateWithFunction_error_(func, None)[0]
computeEncoder.setComputePipelineState_(pso) # set kernel to call
computeEncoder.setBuffer_offset_atIndex_(input_buffer, 0, 0) # arg1
grd = grp = Metal.MTLSizeMake(1, 1, 1) # 1 thread globally
computeEncoder.dispatchThreads_threadsPerThreadgroup_(grd, grp)
computeEncoder.endEncoding() # end
This completes our command definition. For our third step, we now execute the command.
# Execute the 'command' we defined above
commandBuffer.commit()
commandBuffer.waitUntilCompleted()
This completes our kernel execution. Let's now add a few lines of code to verify the results of our kernel execution.
# Check output. Input was 0, kernel adds 2, so output is 2.
buffer = input_buffer.contents().as_buffer(1) # get buffer
item = ctypes.c_uint8.from_buffer(buffer) # cast to uint8
assert item.value == 2, f"❌ Output does not match reference!"
print("✅ Reference matches output!")
This completes our launcher from scratch. Now, run your file to load, execute, and verify the Metal kernel.
python run.py
The comparison should succeed and give the following success message.
✅ Reference matches output!
This now completes your very first kernel launcher from scratch. For our next step, we'll initialize the input argument to a more useful value, other than a single zero.
Step 2: Populate inputs
Above, we left the input unpopulated, using its default value of zero and using an array of just length one. Let's now initialize the input more reasonably — using more and non-zero values. Start by modifying our buffer to be longer. In particular, we'll setup a 1024-length buffer for 1024 1-byte values.
N = 1024 # our input will have 1024 integers
input_buffer = dev.newBufferWithLength_options_(N, storage)
Next, grab a modifiable version of the input buffer's contents, and populate it with random integers.
input_contents = input_buffer.contents().as_buffer(N)
random_integers = [random.randint(0, 253) for _ in range(N)]
for i in range(N): # copy random values into buffer
input_contents[i] = random_integers[i]
Now that our input is much longer, also update how our kernel parallelizes over the input. In particular, we previously asked the kernel to work with just one thread. Now, with 1024 elements, there are 1024 total "jobs" that need to be executed. This is the number of threads per grid.
grp = Metal.MTLSizeMake(32, 1, 1) # 32 threads per group
grd = Metal.MTLSizeMake(1024, 1, 1) # 1024 threads per grid
Finally, update our verification code at the end of the file to check all elements in the input array, instead of just the first item.
output_metal = list(input_contents)
output_python = [x + 2 for x in random_integers]
assert output_metal == output_python, f"❌ Output does not match reference!"
This concludes a more practical kernel, which takes in randomized integer values, and produces a verifiably correct output. For our next step, we'll update our input data type to be floating point instead of integer.
Step 3: Working with floating point
Our kernel above operates only with integers, but real-world use cases for custom GPU kernels involve floating point numbers as well. As a result, let's update our kernel to using floating point. Update the method signature, and change the integer 2
into a float 2.0
.
kernel void add2_kernel(device float *in [[ buffer(0) ]],
uint id [[ thread_position_in_grid ]]) {
in[id] = in[id] + 2.0; /* add 2 in-place */
When switching from integers to floating point, there are two core issues:
- Floating point numbers generally take several bytes. FP16, the smallest popularly-supported bitwidth, requires 2 bytes.
- Floating point can't be "read directly". This is in contrast to integers, where you can quite easily read off the value of your base-2 number. With floating point, some arithmetic is needed to translate the mantissa and exponent bits into a value.
These two challenges above mean that we need to be careful. Previously, we had a number of simplifications, since each value was exactly a byte long. Now, we need to keep the notions of array length and byte length separate.
N = 1024 # our input will have 1024 integers
B = N * 4 # fp32 is 4 bytes per value
Next, before updating our input buffer, we need to cast the contents to floating point. For this example, we'll use FP32. Like before, we then generate random values and populate the input buffer with our generated values.
input_array = (ctypes.c_float * N).from_buffer(input_contents)
input_list = [random.random() for _ in range(N)] # generate
input_array[:] = input_list # copy random values into buffer
Since our ground truth and kernel-computed outputs are both now floating point, we also need to update our comparison expression in the verification step. In short, instead of checking of exact matches, we can add a tolerance for a tiny epsilon value, such as 1e-5
.
assert [(a - b) < 1e-5 for a, b in zip(output_metal, output_python)], f"❌ Output does not match reference!"
This completes our modifications floating point. Now, run your file.
python run.py
Just like with every time before, this should then give you the following message, indicating success.
✅ Reference matches output!
This completes our latest iteration of the Metal kernel. You now have a working Python integration with Metal, via Python bindings for the Objective C Metal API.
Bonus: Interoperating with Numpy
Most of you reading this use NumPy or PyTorch — or some kind of library for linear algebra and vector math. Let's see a quick and dirty example of using Metal with NumPy. To start, create a random array of values using NumPy, then define a new buffer that reads directly from the generated array.
input_array = np.random.random(1024).astype(np.float32)
input_buffer = dev.newBufferWithBytes_length_options_(
input_array, input_array.nbytes, storage)
Next, after your kernel has finished executing, grab the same readable buffer of contents from the input array, then cast that buffer into a floating point NumPy array.
output = input_buffer.contents().as_buffer(input_array.nbytes)
output_array = np.frombuffer(output, dtype=np.float32)
Finally, in your verification code, use vector math to bound all the errors, instead of manually looping over both vectors.
error = np.abs((input_array + 2.0) - output_array).max()
assert error < 1e-5, f"❌ Output does not match reference!"
This finally completes your NumPy-friendly Metal kernel. Let's run your script.
python run.py
Just like with every time before, this should then give you the following message, indicating success.
✅ Reference matches output!
This completes our final version of the Metal kernel, fully interoperable with NumPy. You can use a similar technique to interoperate with PyTorch as well.
Conclusion
In short, executing Metal kernels from Python directly is possible because Python bindings for Objective C's Metal API exist. We use this liberally in this post, building up our launcher and runner incrementally, until we ended up with a generic, NumPy-compatible script.
You can now use this script as a boilerplate, to accelerate your computational workloads and test actual, on-device inference speeds with custom Metal kernels on Apple GPU.
-
Of course, writing and optimizing CUDA kernels is no easy task. This is just to say that the pipeline for doing so is well established. ↩
-
You can learn more about the Metal Shader Language (MSL) in the official Apple Metal specification. This document includes all the information you should need for writing kernels. The Wikipedia article on the Metal API may also be helpful. ↩
Want more tips? Drop your email, and I'll keep you in the loop.